Robust feature matching in 2.3

合集下载

Robust-forensic-...

Current BiologyMagazineCurrent Biology 28, R1–R16, January 8, 2018 © 2017 Elsevier Ltd. R13Robust forensic matching of conﬁ scated horns to individualpoached African rhinocerosCindy Harper 1,2,*, Anette Ludwig 1, Amy Clarke 1, Kagiso Makgopela 1, Andrey Yurchenko 2, Alan Guthrie 1, Pavel Dobrynin 2, Gaik Tamazian 2,Richard Emslie 3, Marile van Heerden 4, Markus Hofmeyr 1,5, Roderick Potter 6, Johannes Roets 7, Piet Beytell 8, Moses Otiende 9, Linus Kariuki 9,Raoul du Toit 10, Natasha Anderson 10, Joseph Okori 11, Alexey Antonik 2, Klaus-Peter Koepﬂ i 2,12, Peter Thompson 1,and Stephen J. O’Brien 2,13Black and white rhinoceros (Diceros bicornis and Ceratotherium simum ) are iconic African species that are classiﬁ ed by the International Union for the Conservation of Nature (IUCN) as Critically Endangered and Near Threatened (/), respectively [1]. At the end of the 19th century, Southern white rhinoceros (Ceratotherium simum simum ) numbers had declined to fewer than 50 animals in the Hluhluwe-iMfolozi region of the KwaZulu-Natal (KZN) province of South Africa, mainly due to uncontrolled hunting [2,3]. Efforts by the Natal Parks Board facilitated an increase in population to over 20,000 in 2015 through aggressive conservation management [2].Black rhinoceros (Diceros bicornis ) populations declined from several hundred thousand in the early 19th century to ~65,000 in 1970 and to ~2,400 by 1995 [1] withsubsequent genetic reduction, also due to hunting, land clearances and later poaching [4]. In South Africa, rhinoceros poaching incidents have increased from 13 in 2007 to 1,215 in 2014 [1]. This has occurred despite strict trade bans on rhinoceros products and strict enforcement in recent years.The signiﬁ cant increase in illegal killing of African rhinoceros and the involvement of transnationalorganised criminal syndicates in horn trafﬁ cking has met with increased law enforcement efforts to apprehend, successfully prosecute and sentence trafﬁ ckers and poachers with the aim of reducing poaching. In Africa, wildlife rangers, law enforcement ofﬁ cials and genome scientists have instituted a DNA-based individual identiﬁ cation protocol usingcomposite short tandem repeat (STR) genotyping of rhinoceros horns,rhinoceros tissue products and crime scene carcasses to link conﬁ scated evidence to speciﬁ c poaching incidents for support of criminalinvestigations. This method has been used extensively and documented in the RhODIS® (Rhinoceros DNA Index System) database of conﬁ scated horn and living rhinoceros genotypes (http://rhodis.co.za ), eRhODIS™applications to collect ﬁeld and forensic sample data and RhODIS® biospecimen collection kits. These are made available to trained RhODIS® certiﬁ ed ofﬁ cials to fulﬁ ll chain of custody requirements providing a pipeline to connect illegally trafﬁ cked rhinoceros products to individual poached rhinoceros victims. This study applies a panel of 23 STR (microsatellite) loci to genotype 3,968 individual rhinoceros DNA specimens from distinct white and black rhinoceros populations [5]. We assessed the population genetic structure of these (Supplemental information) and applied them to forensic match analyses of speciﬁ c DNA proﬁ les in more than 120 criminal cases to date.Four methods were applied to support forensic matching of conﬁ scated tissue evidence to crime scenes: ﬁ rst, furthercharacterization and optimization of STR panels informative for rhinoceros species; second, development and application of the RhODIS® database containing genotypes and demographic information of more than 20,000 rhinoceros acquisitions; third, analysis of the populationgenetic structure of white and black rhinoceros species, subspecies and structured populations; and fourth, computation of match probability Correspondencestatistics for speciﬁ c proﬁ les derived from white and black rhinoceroses. We established a reference database consisting of 3,085 genotypes of white rhinoceros (C. simum ) and 883 black rhinoceros (D. bicornis ) sampled since 2010 which provide the basis for robust match probability statistics.The effects of historic range contractions or expansions, migration, translocation andpopulation fragmentation caused by poaching and habitat reduction on rhinoceros population genetic structure have been reported but are limited [6–8]. Southern whiterhinoceros are traditionally considered panmictic and comprising a single subspecies, C. s. simum , as a result of the severe founder effect in the late 19th century [2]. Black rhinoceros are generally subdivided into three modern subspecies, D.b. bicornis , D.b. michaeli and D.b. minor [8]. Population structure of white and black rhinoceros based upon three different analyses (Supplemental information) afﬁ rmed the partition of white versus black rhinoceros species plus the separation of the three black rhinoceros subspecies. The STRUCTURE algorithm revealed a ﬁ ne grain distinctiveness between black rhinoceros D.b. minor populations from Zimbabwe and KwaZulu-Natal (KZN), South Africa and also indicates that black rhinoceros in the Kruger National Park (KNP) are comprised of a mix of KZN and Zimbabwe rhinoceros as expected, since KNP black rhinoceros founders originated from these two locales [9].For forensic match applications, we calculated allele frequencies for all polymorphic unlinked loci for white (3,085 genotypes) and black rhinoceros (883 genotypes). These estimates and other STR locus statistics were calculated for each rhinoceros species. Population differentiation (F ST ) between white and black rhinoceros subspecies supports the recognition of theSouthern white rhinoceros subspecies (C. s. simum ), and three blackrhinoceros subspecies, D.b. bicornis , D.b. michaeli and D.b. minor , with signiﬁ cant partitioning of the Zimbabwe versus KZN D.b. minor populations in the present Africanrhinoceros populations.Current BiologyMagazineR14 Current Biology 28, R1–R16, January 8, 2018Over 5,800 rhinoceros crime cases have been submitted to RhODIS ® since 2010 and in excess of 120 case reports relating carcass material to evidence items (horn, tissue, blood stains and other conﬁ scated materials) have been provided to investigators. Table 1 summarizes nine of these rhinoceros crime cases which have been concluded in court. These are illustrative of where DNA matches were made and the use of thisevidence for prosecution, conviction and sentencing of perpetrators of rhinoceros crimes. Table 1 includes case sample details, species identiﬁ ed and match probability calculated using the RhODIS ® reference database. Thesuccessful prosecution, conviction and sentencing of suspects in South Africa and other countries afﬁ rm the utilityof the RhODIS ®approach in criminal prosecutions of the perpetrators of illegal rhinoceros trade and provide an international legal precedent for prosecution of rhinoceros crimes using a robust forensic matching of conﬁ scated evidence items to speciﬁ c wildlife crime scenes.SUPPLEMENTAL INFORMATIONSupplemental Information includingexperimental procedures, one ﬁgure and one table can be found with this article online at https:///10.1016/j.cub.2017.11.005.Table 1. Summary of nine prosecuted cases of rhinoceros crime. Samples were successfully matched using composite STR genotyping with cumulative match probability calculated using a conservative Theta ( ) of 0.1. Details of case with matching evi-dence items, location of poaching incident, species and subspecies identiﬁ ed, cumulative match probability, status of the case (conviction date: sentence) and the nationalities of the accused are provided for six South African cases and single cases from Kenya, Namibia and Singapore. (KNP – Kruger National Park, SA – South Africa, ORTIA – OR Tambo International Airport, HiP – Hluhluwe-iMfolozi Park, OPC – Ol Pejeta Conservancy, ENP – Etosha National Park). a and b refer to match probability calculations for speciﬁ c white and black rhinoceros summarised in Supple-mental information.REFERENCES1. Emslie, R.H., Milliken, T., Talukdar, B., Ellis,S., Adcock, K., and Knight, M.H. (2016). African and Asian Rhinoceroses - Status,Conservation and Trade. In A Report from the IUCN Species Survival Commission (IUCN SSC) African and Asian Rhino SpecialistGroups and TRAFFIC to the CITES Secretariat pursuant to Resolution Conf. 9.14 (Rev. CoP15).2. Player, I. (2013). The White Rhino Saga,(Johannesburg: Jonathan Ball Publishers). 3. Walker, C., and Walker, A. (2012). The RhinoKeepers, (Johannesburg: Jacana Media). 4. Milliken, T., and Shaw, J. (2012). The SouthAfrica – Viet Nam Rhino Horn Trade Nexus: A deadly combination of institutional lapses, corrupt wildlife industry professionals and Asian crime syndicates. TRAFFIC, Johannesburg, South Africa.5. Harper, C.K., Vermeulen, G.J., Clarke, A.B.,De Wet, J.I., and Guthrie, A.J. (2013).Extraction of nuclear DNA from rhinoceros horn and characterization of DNA proﬁ ling systems for white (Ceratotherium simum ) and black (Diceros bicornis ) rhinoceros. Forensic Sci. Int. Genet. 7, 428–433.6. Anderson-Lederer, R.M., Linklater, W.L., andRitchie, P .A. (2012). Limited mitochondrial DNA variation within South Africa’s black rhino (Diceros bicornis minor ) population and implications for management. Afr. J. Ecol. 50, 404–413.7. Kotzé, A., Dalton, D.L., Du Toit, R.,Anderson, N., and Moodley, Y. (2014). Genetic structure of the black rhinoceros (Diceros bicornis ) in south-eastern Africa. Conserv. Genet. 15, 1479–1489.8. Moodley, Y., Russo, I.R.M., Dalton, D.L., Kotzé,A., Muya, S., Haubensak, P ., Bálint,B., Munimanda, G.K., Deimel,C., Setzer, A.,et al . (2017). Extinctions, genetic erosion and conservation options for the black rhinoceros (Diceros bicornis ). Sci. Rep. 7, 41417.9. Hall-Martin, A. (1988). Conservation of the blackrhino: the strategy of the National Parks Board of South Africa. Quagga 1, 12–17.1Faculty of Veterinary Science, University of Pretoria, Onderstepoort 0110, South Africa. 2Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, St. Petersburg, Russia 199004. 3IUCN SSC African Rhino Specialist Group, Hilton 3245, South Africa. 4National Prosecuting Authority, Silverton 0184, South Africa. 5Veterinary Wildlife Services, South African National Parks, Skukuza, South Africa. 6Ezemvelo KZN Wildlife, Queen Elizabeth Park, Pietermaritzburg 3201, South Africa. 7South African Police Service, Stock Theft and Endangered Species Unit, Pretoria 0001, South Africa. 8Ministry of Environment and Tourism, Windhoek, Namibia. 9Kenya Wildlife Service, Nairobi 00100, Kenya. 10Lowveld Rhino Trust, Harare, Zimbabwe. 11WWF: African Rhino Programme, Cape Town, South Africa. 12Smithsonian Conservation Biology Institute, 3001 Connecticut Ave NW,Washington, DC 20008, USA. 13Guy Harvey Oceanographic Center, Nova Southeastern University, 8000 North Ocean Drive, Ft Lauderdale, FL 33004, USA.*E-mail: ******************.za。

图像拼接算法及实现(一).

图像拼接算法及实现（一）论文关键词：图像拼接图像配准图像融合全景图论文摘要：图像拼接(image mosaic)技术是将一组相互间重叠部分的图像序列进行空间匹配对准,经重采样合成后形成一幅包含各图像序列信息的宽视角场景的、完整的、高清晰的新图像的技术。

图像拼接在摄影测量学、计算机视觉、遥感图像处理、医学图像分析、计算机图形学等领域有着广泛的应用价值。

一般来说,图像拼接的过程由图像获取,图像配准,图像合成三步骤组成,其中图像配准是整个图像拼接的基础。

本文研究了两种图像配准算法:基于特征和基于变换域的图像配准算法。

在基于特征的配准算法的基础上,提出一种稳健的基于特征点的配准算法。

首先改进Harris角点检测算法,有效提高所提取特征点的速度和精度。

然后利用相似测度NCC(normalized cross correlation——归一化互相关),通过用双向最大相关系数匹配的方法提取出初始特征点对,用随机采样法RANSAC(Random Sample Consensus)剔除伪特征点对,实现特征点对的精确匹配。

最后用正确的特征点匹配对实现图像的配准。

本文提出的算法适应性较强,在重复性纹理、旋转角度比较大等较难自动匹配场合下仍可以准确实现图像配准。

Abstract：Image mosaic is a technology that carries on thespatial matching to a series of image which are overlapped with each other, and finally builds a seamless and high quality image which has high resolution and big eyeshot. Image mosaic has widely applications in the fields of photogrammetry, computer vision, remote sensing image processing, medical image analysis, computer graphic and so on. 。

人脸识别介绍_IntroFaceDetectRecognition

Knowledge-based Methods: Summary
Pros:
Easy to come up with simple rules Based on the coded rules, facial features in an input image are extracted first, and face candidates are identified Work well for face localization in uncluttered background
Template-Based Methods: Summary
Pros:
Simple
Cons:
Templates needs to be initialized near the face images Difficult to enumerate templates for different poses (similar to knowledgebased methods)
Knowledge-Based Methods
Top Top-down approach: Represent a face using a set of human-coded rules Example:
The center part of face has uniform intensity values The difference between the average intensity values of the center part and the upper part is significant A face often appears with two eyes that are symmetric to each other, a nose and a mouth

image alignment and stitching a tutorial

Image Alignment and Stitching: A Tutorial1
Richard Szeliski Last updated, December 10, 2006 Technical Report MSR-TR-2004-92
This tutorial reviews image alignment and image stitching algorithms. Image alignment algorithms can discover the correspondence relationships among images with varying degrees of overlap. They are ideally suited for applications such as video stabilization, summarization, and the creation of panoramic mosaics. Image stitching algorithms take the alignment estimates produced by such registration algorithms and blend the images in a seamless manner, taking care to deal with potential problems such as blurring or ghosting caused by parallax and scene movement as well as varying image exposures. This tutorial reviews the basic motion models underlying alignment and stitching algorithms, describes effective direct (pixel-based) and feature-based alignment algorithms, and describes blending algorithms used to produce seamless mosaics. It closes with a discussion of open research problems in the area.

多条带侧扫声呐图像精拼接方法研究

2021年5期创新前沿科技创新与应用Technology Innovation and Application多条带侧扫声呐图像精拼接方法研究*高飞1，王晓2*，杨敬华2，张博宇2，周海波2，陈佳星2（1.青岛市勘察测绘研究院，山东青岛266032；2.江苏海洋大学海洋技术与测绘学院，江苏连云港222005）引言随着陆地资源日益枯竭，世界各国已将资源开发和利用的重点转向海洋，我国为此制定了“海洋强国”战略。

针对海洋的勘察活动日益增多，了解海底表层及浅表层结构对海洋科学研究和海洋工程建设意义重大[1]。

侧扫声呐（Side Scan Sonar ，SSS ）作为海底高分辨率图像的快速获取技术，在海洋工程建设、海底资源开发和目标探测、识别等领域应用广泛[2-7]。

由于侧扫声呐拖曳式作业和海洋中潮汐、波浪等环境影响，由测量船GNSS 坐标推算所得托鱼位置存在偏差，地理编码图像位置不准，因此，多条带编码拼接图像存在相邻条带目标错位问题；常用的国外数据处理软件诸如Isis （Triton ）、Sips （Caris ）、SonarWeb （Chesaspeake ）等，均提供地理拼接功能[8，9]，不能实现海底地貌“一张图”精细获取。

为解决此问题，Zhao等[10，11]提出了相邻条带SSS 图像SURF 特征拼接方法，为解决SURF 特征匹配耗时问题，采取了基于航迹坐标的图像分段策略，一定程度上提高了运算效率；王爱学等[12]顾及目标的局部畸变问题，给出了弹性匹配策略，实现了共视海床目标的绝对保形；郭军[13]，倪先锋[14]，侯雪[15]，伍梦[16]，潘建平[17]等也研究了相关SSS 图像SURF 特征拼接问题；但前述所有方法SURF 特征匹配耗时均不能满足大区域图像拼接程序实时处理的要求。

且传统特征拼接时，固定一幅图像，旋转变换其余条带，最远端条带图像拼接后，地理位置丢失；若存在不能特征拼接条带，也无法联合地理编码和特征实现海底地貌“一张图”精细获取。

Illumination-robust pattern matching using distorted color histograms

Pattern Matching Using Distorted ColorHistogramsGeorg Thimm and Juergen LuettinInstitut Dalle Molle d’Intelligence Artiﬁcielle Perceptive(IDIAP),C.P.592,CH-1920Martigny,Switzerland.Email:Thimm,Luettin@idiap.chThe appearance of objects is often subject to illumination variations,which impedes the recognition of these objects.Appearance changes caused by illumination variation can roughly be classiﬁed in partial shadowing(including shadows caused by the object itself),occlusion,specular reﬂection,total shadowing,and global illumination changes (i.e.the average grey-level of the scene,respectively of the whole object,is changing). Occlusion,partial shadowing,and specular reﬂection cause the most difﬁculties in the context of computer vision.Highly sophisticated approaches use for example an approx-imate3-dimensional representation of the scene and the position of the light source(s) [6][8],a combined PCA model of shape and intensity on landmark points[4],respect-ively active shapes[5][7],a model for the object under multiple illumination situations (Eigenfaces[13]),a direct model of the illumination variation and specular reﬂections[2], or3-dimensional models and neural networks to estimate the position of the light sources [3].Global illumination changes and total shadowing,however,are not well modeled by these approaches.To the knowledge of the authors,global illumination changes were only considered in combination with other image analysis methods,for example in the context of change detection(see[12]for more references),or opticalﬂow[10].We assume that it is inefﬁcient to model directly and only the appearance change of an object.A better approach will model the illumination that is global to the object separate from other appearance changes.The example inﬁgure1illustrates a possible situation considered in this publication. Suppose that faces have to be recognized in an outdoor scene,neglecting appearance changes due to orientation and partial shadowing.Depending on the daytime,the il-lumination of the scene changes and at the same time the relative brightest of objects. Consequently,a normalized grey-level histogram1of the scene is also subject to altern-ations.For example,the grey-level histogram of the scene in the late and early eveningare different.The artiﬁcial light sources correspond as before to the brightest parts of the histogram,but the middle of the histogram is “emptied”on the cost of the darker parts.At the same time,the face appears to be darker as compared to the image taken in the early evening.Looking at the distribution of grey-levels,this means that the face contributes to the score of lower intensities.This is symbolized by the dashed boxes in ﬁgure 1.In other words,the grey-level histogram is non-linearly projected onto another,or distorted .In order to compensate for the distortion of the grey-level histograms,the histogram of the %gray-level %gray-levelEarly eveningLate evening N o r m a l i z e d g r a y l e v e l h i s t o g r a m s Figure 1:illumination changes distort the grey-level histogram.template has to be modiﬁed prior to a matching with some image location.The function mapping the histogram of the template to the histogram of the image models the illu-mination variation.Therefore the shape of this function is constrained according to three assumptions:1.As the image is normalized,the lowest and highest intensities in the grey-level histogram will be mapped onto themselves.2.Contrasts diminish or augment comparably when the global illumination changes.Therefore,modiﬁcations of grey-levels must vary smoothly within neighboring in-tensity values.3.The relative brightness of arbitrary objects must remain unchanged:if a certain spot in the image is brighter than another spot,it will remain brighter or,in the limit,assume the same intensity.A simple pattern algorithm using such a histogram mapping function matching can be formulated in the following way:let be a feature vector of grey values,representing the template and a vector extracted from some image to be compared with .Then the most likely position for the object represented by the template can be deﬁned as2Where the function distorts the color or grey-level histogram.is parameterized by a vector corresponding to the deviation of the illumination as compared to the il-lumination of he template.Since is usually unknown,it has to be included into the minimization of the error:As discussed earlier,has to fulﬁll some conditions in order to avoid a tooﬂexible mapping which would result in low scores for illicit image locations.1.The invariability of the lowest and highest intensity can be directly formulated as acondition on.Supposed that the images from which and were extracted are normalized and black is coded as min and white as max(usually and).Then has to fulﬁll:,andmin minmax max2.The similarity constrain on the variation of close grey-levels can be fulﬁlled bydemanding that posses a smoothﬁrst derivative.3.That grey-levels are not interchangeable implies,that the mapping function is non-decreasing for the range of valid grey-levels.As posses aﬁrst derivative:for min maxConsidering these constraints,was chosen to be a second order polynomial.It follows from the constraints above that has the formmin maxwith a free variable restricted to the intervalmax minThis function has the property,that,depending on the sign of,either the contrasts in the brighter parts,respectively the darker parts,of the image are augmented.At the same time,the contrasts in the darker parts,respectively the brighter parts,are lowered.The form of has the advantage that an explicit solution for exists.The proposed method was tested on4,000X-ray images of the vocal tract of talking persons[9].In these tests,ﬁllings in the upper and lower teeth,as well as the tips of the front teeth were tracked.The results are compared in experiments with a standard pattern matching algorithm(that is equivalent to)and the Eigenface method using different numbers of eigenvectors.The results showed that the proposed method performed better than each of the other approaches.3In the way the distorted pattern matching approach is described above,it can not sensibly be applied when appearance changes are evoked by global illumination changes and other incidents.This deﬁciency may in principle be overcome by combining it with other techniques,for example PCA models of the grey-level appearance[5].For this method of object modelization,the most likely position of can be redeﬁned aswhere an eigenvector matrix,a mean appearance,and an appearance vector de-scribe the appearance of an object under constant global illumination.First experiments were performed with automatically generated images of faces under various illuminations[1](implemented by H.Rowley[11]).In these experiments the combined method showed some improvement for the tasks“locate the mouth”and“locate the left eye”over pattern matching with and without distorted grey-level histogram and genuine grey-level PCA modelization.Full details and statistics on the performance of the algorithms will be included in the full paper.ConclusionWe proposed a simple to use,but still efﬁcient,method for the modelization of a global illumination using distorted grey-level histograms.A quantitative comparison in experi-ments with standard pattern matching and PCA modelization of the grey-level appearance shows that the proposed algorithm outperforms both.Besides this,pattern matching with distorted histograms has a complexity close to standard pattern matching.This gives a further advantage over the Eigenface algorithm,which has a higher computational com-plexity and is somewhat more difﬁcult to use and implement.References1.P.N.Belhumeur and D.J.Kriegman.What is the set of images of an object under allpossible lighting conditions?In Proceedings of the1996Conference on Computer Vision and Pattern Recognition(CVPR’96),pages270–277,1996.2.Michael J.Black,David J.Fleet,and Yaser Yacoob.A framework for modeling ap-pearance change in image sequences.In Proc.of the Sixth International Conference on Computer Vision(ICCV98).IEEE,January1998.3.R.Brunelli.Estimation of pose and illuminant direction for face processing.Imageand Vision Computing,10(15):741–748,1997.44.T.F.Cootes and C.J.Taylor.Modelling object appearance using the grey-level sur-face.In Proceedings of the5th British Machine Vision Conference,pages479–488, York,1994.5.T.F.Cootes and ing grey-level models to improve active shape modelsearch.In Proceedings-International Conference on Pattern Recognition,volume1, pages63–67.IEEE,Piscataway,NJ,USA,1994.6.A.S.Georghiades,D.J.Kriegman,and P.N.Belhumeur.Illumination cones for recog-nition under variable lighting:Faces.In IEEE Conf.on Computer Vision and Pattern Recognition,1998.nitis,C.J.Taylor,and T.F.Cootes.Recognising human faces using shapeand grey-level information.In Proceedings of the3rd International Conference on Automation,Robotics and Computer Vision,volume2,pages1153–1157,Singapore, 1994.8.N.Mukawa.Estimation of shape,reﬂection coefﬁcients,and illuminant directionfrom image sequences.In International Conference on Computer Vision(ICCV90), pages507–512,1990.9.K.G.Munhall,E.V atikiotis-Bateson,and Y.Tokhura.X-rayﬁlm database for speechresearch.Journal of the Acoustical Society of America,98(2):1222–1224,1995. 10.S.Negahdaripour and C.H.Y u.A generalized brightness change model for comput-ing opticalﬂow.In International Conference on Computer Vision(ICCV93),pages 2–11,1993.11.Henry Rowley.WWW Home page,URL:/afs//user/har/Web/home.html,1998.12.K.D.Skifstad and R.C.Jain.Illumination independent change detection for realworld image puter Vision Graphics and Image Processing(CVGIP), 46(3):387–399,June1989.13.M.Turk and A.Pentland.Eigenfaces for recognition.Journal of Cognitive Neuros-cience,3(1):71–96,1991.5。

英文-基于视觉的旋翼无人机地面目标跟踪

SIFT algorithm is used to recognize the ground target in 用于识别地面目 this paper. The SIFT algorithm, first proposed by David. G. 标 Lowe in 1999 [5] and improved in 2004 [6], is a hot field of feature-matching at present, and its effectiveness is invariant of image rotation, scale zoom and brightness transformations, and also maintains a certain degree of stability on 透视变换仿射变换 perspective transformation and affine transformation. SIFT feature points are scale-invariant local points of an image, with the characteristics of good uniqueness, informative, large amounts, high speed, scalability, and so on. A. SIFT Algorithm The SIFT algorithm consists of four parts. The process of SIFT feature construction is shown in Fig. 1.
目标识别算法
SIFT算法本文是
性、准确性快
I. INTRODUCTION UAV is one of the best platforms to perform dull, dirty or dangerous (3D) tasks [1]. UAV can be used in various applications where human is impossible to intervene. It greatly expands the application space of visual tracking. Research on the technology of vision based ground target tracking for UAV has been a great concern among cybernetic experts and robotic experts, and has become one of the most active research directions in UAV applications. Currently, researchers from America, Britain, France and Sweden are on the cutting edge in this field [2]. Typical visual tracking platforms for UAV include Scan Eagle, GTMax, RQ-11, RQ-16, DragonFly, etc. Because of many advantages, such as small size, light weight, flexible, easy to carry and low cost, rotor UAV has a broad application prospect in the fields of traffic monitoring, resource exploration, electricity patrol, forest fire prevention, aerial photography, atmospheric monitoring, etc [3]. Vision based ground target tracking system for rotor UAV is such a system that gets images by the camera installed on a low-flying rotor UAV, then recognizes the target in the images and estimates the motion state of the target, and finally according to the visual information regulates the pan-tilt-zoom (PTZ) camera automatically to keep the target at the center of the camera view. In view of the current situation of international researches, the study of ground target tracking system for

图像拼接技术

图像拼接技术图像拼接技术简介图像拼接是将同⼀场景的多个重叠图像拼接成较⼤的图像的⼀种⽅法，在医学成像、计算机视觉、卫星数据、军事⽬标⾃动识别等领域具有重要意义。

图像拼接的输出是两个输⼊图像的并集。

所谓图像拼接就是将两张有共同拍摄区域的图像⽆缝拼接在⼀起。

这种应⽤可应⽤于车站的动态检测、商城的⼈流检测、⼗字路⼝的交通检测等，给⼈以全景图像，告别⽬前的监控墙或视频区域显⽰的时代，减轻⼯作⼈员“眼”的压⼒。

基本思想：图像拼接并⾮简单的将两张有共同区域的图像把相同的区域重合起来，由于两张图像拍摄的⾓度与位置不同，虽然有共同的区域，但拍摄时相机的内参与外参均不相同，所以简单的覆盖拼接是不合理的。

因此，对于图像拼接需要以⼀张图像为基准对另外⼀张图像进⾏相应的变换（透视变换），然后将透视变换后的图像进⾏简单的平移后与基准图像的共同区域进⾏重合。

说明：1、图像预处理是为了增强图像的特征，预处理可以包含：灰度化、去燥、畸变校正等。

2、特征点提取可⽤的⽅法有：sift、surf、fast、Harris等，sift具有旋转与缩放不变性，surf为sift的加速，检测效果都不错，在此先⽤sift进⾏实现。

3、单应性矩阵求取时要清楚映射关系，是第⼀张图像空间到第⼆张图像空间的映射，还是第⼆张图像到第⼀张图像的映射，这个在变换的时候很重要。

4、判断左右（上下）图像是为了明确拼接关系,建议将左右图像的判断放在求取单应性矩阵之前，这样映射关系不⾄于颠倒。

否则将会出现拼接成的图像有⼀半是空的。

通常⽤到五个步骤：特征提取 Feature Extraction：在所有输⼊图像中检测特征点图像配准 Image Registration：建⽴了图像之间的⼏何对应关系，使它们可以在⼀个共同的参照系中进⾏变换、⽐较和分析。

⼤致可以分为以下⼏个类1. 直接使⽤图像的像素值的算法,例如,correlation methods2. 在频域处理的算法,例如,基于快速傅⾥叶变换(FFT-based)⽅法;3. 低⽔平特征的算法low level features,通常⽤到边缘和⾓点，例如，基于特征的⽅法,4. ⾼⽔平特征的算法high-level features,通常⽤到图像物体重叠部分，特征关系，例如，图论⽅法（Graph-theoretic methods）图像变形 Warping：图像变形是指将其中⼀幅图像的图像重投影，并将图像放置在更⼤的画布上。

Speeded-Up Robust Features (SURF)

Speeded-Up Robust Features (SURF)Herbert Bay a ,Andreas Essa,*,Tinne Tuytelaars b ,Luc Van Goola,ba ETH Zurich,BIWI,Sternwartstrasse 7,CH-8092Zurich,Switzerland bK.U.Leuven,ESAT-PSI,Kasteelpark Arenberg 10,B-3001Leuven,BelgiumReceived 31October 2006;accepted 5September 2007Available online 15December 2007AbstractThis article presents a novel scale-and rotation-invariant detector and descriptor,coined SURF (Speeded-Up Robust Features).SURF approximates or even outperforms previously proposed schemes with respect to repeatability,distinctiveness,and robustness,yet can be computed and compared much faster.This is achieved by relying on integral images for image convolutions;by building on the strengths of the leading existing detectors and descriptors (speciﬁcally,using a Hessian matrix-based measure for the detector,and a distribution-based descriptor);and by sim-plifying these methods to the essential.This leads to a combination of novel detection,description,and matching steps.The paper encompasses a detailed description of the detector and descriptor and then explores the eﬀects of the most important param-eters.We conclude the article with SURF’s application to two challenging,yet converse goals:camera calibration as a special case of image registration,and object recognition.Our experiments underline SURF’s usefulness in a broad range of topics in computer vision.Ó2007Elsevier Inc.All rights reserved.Keywords:Interest points;Local features;Feature description;Camera calibration;Object recognition1.IntroductionThe task of ﬁnding point correspondences between two images of the same scene or object is part of many com-puter vision applications.Image registration,camera cali-bration,object recognition,and image retrieval are just a few.The search for discrete image point correspondences can be divided into three main steps.First,‘interest points’are selected at distinctive locations in the image,such as cor-ners,blobs,and T-junctions.The most valuable property of an interest point detector is its repeatability.The repeat-ability expresses the reliability of a detector for ﬁnding the same physical interest points under diﬀerent viewing condi-tions.Next,the neighbourhood of every interest point is represented by a feature vector.This descriptor has to be distinctive and at the same time robust to noise,detectiondisplacements and geometric and photometric deforma-tions.Finally,the descriptor vectors are matched between diﬀerent images.The matching is based on a distance between the vectors,e.g.the Mahalanobis or Euclidean dis-tance.The dimension of the descriptor has a direct impact on the time this takes,and less dimensions are desirable for fast interest point matching.However,lower dimensional feature vectors are in general less distinctive than their high-dimensional counterparts.It has been our goal to develop both a detector and descriptor that,in comparison to the state-of-the-art,are fast to compute while not sacriﬁcing performance.In order to succeed,one has to strike a balance between the above requirements like simplifying the detection scheme while keeping it accurate,and reducing the descriptor’s size while keeping it suﬃciently distinctive.A wide variety of detectors and descriptors have already been proposed in the literature (e.g.[21,24,27,39,25]).Also,detailed comparisons and evaluations on benchmarking datasets have been performed [28,30,31].Our fast detector and descriptor,called SURF (Speeded-Up Robust1077-3142/$-see front matter Ó2007Elsevier Inc.All rights reserved.doi:10.1016/j.cviu.2007.09.014*Corresponding author.E-mail address:aess@vision.ee.ethz.ch (A.Ess)./locate/cviuAvailable online at Computer Vision and Image Understanding 110(2008)346–359Features),was introduced in[4].It is built on the insights gained from this previous work.In our experiments on these benchmarking datasets,SURF’s detector and descriptor are not only faster,but the former is also more repeatable and the latter more distinctive.We focus on scale and in-plane rotation-invariant detec-tors and descriptors.These seem to oﬀer a good compromise between feature complexity and robustness to commonly occurring photometric deformations.Skew,anisotropic scaling,and perspective eﬀects are assumed to be second order eﬀects,that are covered to some degree by the overall robustness of the descriptor.Note that the descriptor can be extended towards aﬃne-invariant regions using aﬃne normalisation of the ellipse(cf.[31]),although this will have an impact on the computation time.Extending the detector, on the other hand,is less straightforward.Concerning the photometric deformations,we assume a simple linear model with a bias(oﬀset)and contrast change(scale factor).Nei-ther detector nor descriptor use colour information.The article is structured as follows.In Section2,we give a review over previous work in interest point detection and description.In Section3,we describe the strategy applied for fast and robust interest point detection.The input image is analysed at diﬀerent scales in order to guarantee invariance to scale changes.The detected interest points are provided with a rotation and scale-invariant descriptor in Section4.Furthermore,a simple and eﬃcientﬁrst-line indexing technique,based on the contrast of the interest point with its surrounding,is proposed.In Section5,some of the available parameters and their eﬀects are discussed,including the beneﬁts of an upright version(not invariant to image rotation).We also investi-gate SURF’s performance in two important application scenarios.First,we consider a special case of image regis-tration,namely the problem of camera calibration for3D reconstruction.Second,we will explore SURF’s applica-tion to an object recognition experiment.Both applications highlight SURF’s beneﬁts in terms of speed and robustness as opposed to other strategies.The article is concluded in Section6.2.Related work2.1.Interest point detectionThe most widely used detector is probably the Harris corner detector[15],proposed back in1988.It is based on the eigenvalues of the second moment matrix.However, Harris corners are not scale invariant.Lindeberg[21]intro-duced the concept of automatic scale selection.This allows to detect interest points in an image,each with their own characteristic scale.He experimented with both the deter-minant of the Hessian matrix as well as the Laplacian (which corresponds to the trace of the Hessian matrix)to detect blob-like structures.Mikolajczyk and Schmid[26] reﬁned this method,creating robust and scale-invariant feature detectors with high repeatability,which they coined Harris-Laplace and Hessian-Laplace.They used a(scale-adapted)Harris measure or the determinant of the Hessian matrix to select the location,and the Laplacian to select the scale.Focusing on speed,Lowe[23]proposed to approxi-mate the Laplacian of Gaussians(LoG)by a Diﬀerence of Gaussians(DoG)ﬁlter.Several other scale-invariant interest point detectors have been proposed.Examples are the salient region detec-tor,proposed by Kadir and Brady[17],which maximises the entropy within the region,and the edge-based region detector proposed by Jurie and Schmid[16].They seem less amenable to acceleration though.Also several aﬃne-invari-ant feature detectors have been proposed that can cope with wider viewpoint changes.However,these fall outside the scope of this article.From studying the existing detectors and from published comparisons[29,30],we can conclude that Hessian-based detectors are more stable and repeatable than their Harris-based counterparts.Moreover,using the determinant of the Hessian matrix rather than its trace(the Laplacian) seems advantageous,as itﬁres less on elongated,ill-localised structures.We also observed that approximations like the DoG can bring speed at a low cost in terms of lost accuracy.2.2.Interest point descriptionAn even larger variety of feature descriptors has been proposed,like Gaussian derivatives[11],moment invari-ants[32],complex features[1],steerableﬁlters[12], phase-based local features[6],and descriptors representing the distribution of smaller-scale features within the interest point neighbourhood.The latter,introduced by Lowe[24], have been shown to outperform the others[28].This can be explained by the fact that they capture a substantial amount of information about the spatial intensity patterns, while at the same time being robust to small deformations or localisation errors.The descriptor in[24],called SIFT for short,computes a histogram of local oriented gradients around the interest point and stores the bins in a128D vec-tor(8orientation bins for each of4Â4location bins).Various reﬁnements on this basic scheme have been pro-posed.Ke and Sukthankar[18]applied PCA on the gradi-ent image around the detected interest point.This PCA-SIFT yields a36D descriptor which is fast for matching, but proved to be less distinctive than SIFT in a second comparative study by Mikolajczyk and Schmid[30];and applying PCA slows down feature computation.In the same paper[30],the authors proposed a variant of SIFT, called GLOH,which proved to be even more distinctive with the same number of dimensions.However,GLOH is computationally more expensive as it uses again PCA for data compression.The SIFT descriptor still seems the most appealing descriptor for practical uses,and hence also the most widely used nowadays.It is distinctive and relatively fast, which is crucial for on-line applications.Recently,Se et al.[37]implemented SIFT on a Field ProgrammableH.Bay et al./Computer Vision and Image Understanding110(2008)346–359347Gate Array(FPGA)and improved its speed by an order of magnitude.Meanwhile,Grabner et al.[14]also used inte-gral images to approximate SIFT.Their detection step is based on diﬀerence-of-mean(without interpolation),their description step on integral histograms.They achieve about the same speed as we do(though the description step is constant in speed),but at the cost of reduced quality compared to SIFT.Generally,the high dimensionality of the descriptor is a drawback of SIFT at the matching step. For on-line applications relying only on a regular PC,each one of the three steps(detection,description,matching)has to be fast.An entire body of work is available on speeding up the matching step.All of them come at the expense of getting an approximative matching.Methods include the best-bin-ﬁrst proposed by Lowe[24],balltrees[35],vocabulary trees[34],locality sensitive hashing[9],or redundant bit vectors[13].Complementary to this,we suggest the use of the Hessian matrix’s trace to signiﬁcantly increase the matching speed.Together with the descriptor’s low dimen-sionality,any matching algorithm is bound to perform faster.3.Interest point detectionOur approach for interest point detection uses a very basic Hessian matrix approximation.This lends itself to the use of integral images as made popular by Viola and Jones[41],which reduces the computation time drastically. Integral imagesﬁt in the more general framework of box-lets,as proposed by Simard et al.[38].3.1.Integral imagesIn order to make the article more self-contained,we brieﬂy discuss the concept of integral images.They allow for fast computation of box type convolutionﬁlters.The entry of an integral image I RðxÞat a location x¼ðx;yÞT represents the sum of all pixels in the input image I within a rectangular region formed by the origin and x.I RðxÞ¼X i6xi¼0X j6yj¼0Iði;jÞð1ÞOnce the integral image has been computed,it takes three additions to calculate the sum of the intensities over any upright,rectangular area(see Fig.1).Hence,the calcu-lation time is independent of its size.This is important in our approach,as we use bigﬁlter sizes.3.2.Hessian matrix-based interest pointsWe base our detector on the Hessian matrix because of its good performance in accuracy.More precisely,we detect blob-like structures at locations where the determi-nant is maximum.In contrast to the Hessian-Laplace detector by Mikolajczyk and Schmid[26],we rely on the determinant of the Hessian also for the scale selection,as done by Lindeberg[21].Given a point x¼ðx;yÞin an image I,the Hessian matrix Hðx;rÞin x at scale r is deﬁned as followsHðx;rÞ¼L xxðx;rÞL xyðx;rÞL xyðx;rÞL yyðx;rÞ;ð2Þwhere L xxðx;rÞis the convolution of the Gaussian secondorder derivative o22gðrÞwith the image I in point x,and similarly for L xyðx;rÞand L yyðx;rÞ.Gaussians are optimal for scale-space analysis[19,20], but in practice they have to be discretised and cropped (Fig.2,left half).This leads to a loss in repeatability under image rotations around odd multiples of p.This weakness holds for Hessian-based detectors in general. Fig.3shows the repeatability rate of two detectors based on the Hessian matrix for pure image rotation. The repeatability attains a maximum around multiples of p2.This is due to the square shape of theﬁlter.Nev-ertheless,the detectors still perform well,and the slight decrease in performance does not outweigh the advan-tage of fast convolutions brought by the discretisation and cropping.As realﬁlters are non-ideal in any case, and given Lowe’s success with his LoG approximations, we push the approximation for the Hessian matrix even further with boxﬁlters(in the right half of Fig.2). These approximate second order Gaussian derivatives and can be evaluated at a very low computationalcost ing integral images,it takes only three additions and four memory accesses to calculate the sum of intensities inside a rectangular region of anysize.Fig.2.Left to right:The(discretised and cropped)Gaussian second order partial derivative in y-(L yy)and xy-direction(L xy),respectively;our approximation for the second order Gaussian partial derivative in y-(D yy) and xy-direction(D xy).The grey regions are equal to zero.348H.Bay et al./Computer Vision and Image Understanding110(2008)346–359using integral images.The calculation time therefore is independent of theﬁlter size.As shown in Section5 and Fig.3,the performance is comparable or better than with the discretised and cropped Gaussians.The9Â9boxﬁlters in Fig.2are approximations of a Gaussian with r¼1:2and represent the lowest scale(i.e. highest spatial resolution)for computing the blob response maps.We will denote them by D xx,D yy,and D xy.The weights applied to the rectangular regions are kept simple for computational eﬃciency.This yieldsdetðH approxÞ¼D xx D yyÀðwD xyÞ2:ð3ÞThe relative weight w of theﬁlter responses is used to bal-ance the expression for the Hessian’s determinant.This is needed for the energy conservation between the Gaussian kernels and the approximated Gaussian kernels,w¼j L xyð1:2ÞjFj D yyð9ÞjFj L yyð1:2ÞjFj D xyð9ÞjF¼0:912:::’0:9;ð4Þwhere j x jF is the Frobenius norm.Notice that for theoret-ical correctness,the weighting changes depending on the scale.In practice,we keep this factor constant,as this did not have a signiﬁcant impact on the results in our experiments.Furthermore,theﬁlter responses are normalised with respect to their size.This guarantees a constant Frobenius norm for anyﬁlter size,an important aspect for the scale space analysis as discussed in the next section.The approximated determinant of the Hessian repre-sents the blob response in the image at location x.These responses are stored in a blob response map over diﬀerent scales,and local maxima are detected as explained in Sec-tion3.4.3.3.Scale space representationInterest points need to be found at diﬀerent scales,not least because the search of correspondences often requires their comparison in images where they are seen at diﬀerent scales.Scale spaces are usually implemented as an image pyramid.The images are repeatedly smoothed with a Gaussian and then sub-sampled in order to achieve a higher level of the pyramid.Lowe[24]subtracts these pyr-amid layers in order to get the DoG(Diﬀerence of Gaussi-ans)images where edges and blobs can be found.Due to the use of boxﬁlters and integral images,we do not have to iteratively apply the sameﬁlter to the output of a previouslyﬁltered layer,but instead can apply boxﬁlters of any size at exactly the same speed directly on the original image and even in parallel(although the latter is not exploited here).Therefore,the scale space is analysed by up-scaling theﬁlter size rather than iteratively reducing the image size,Fig.4.The output of the9Â9ﬁlter,intro-duced in previous section,is considered as the initial scale layer,to which we will refer as scale s¼1:2(approximating Gaussian derivatives with r¼1:2).The following layers are obtained byﬁltering the image with gradually bigger masks,taking into account the discrete nature of integral images and the speciﬁc structure of ourﬁlters.Note that our main motivation for this type of sampling is its computational eﬃciency.Furthermore,as we do not have to downsample the image,there is no aliasing.On the downside,boxﬁlters preserve high-frequency compo-nents that can get lost in zoomed-out variants of the same scene,which can limit scale-invariance.This was however not noticeable in our experiments.The scale space is divided into octaves.An octave repre-sents a series ofﬁlter response maps obtained by convolv-ing the same input image with aﬁlter of increasing size.In total,an octave encompasses a scaling factor of2(which implies that one needs to more than double theﬁlter size, see below).Each octave is subdivided into a constant num-ber of scale levels.Due to the discrete nature of integral images,the minimum scale diﬀerence between two subse-quent scales depends on the length l0of the positive or neg-ative lobes of the partial second order derivative in the direction of derivation(x or y),which is set to a third of theﬁlter size length.For the9Â9ﬁlter,this length l0is 3.For two successive levels,we must increase this size byFig.3.Top:Repeatability score for image rotation of up to180°.Hessian-based detectors have in general a lower repeatability score for anglesFig.4.Instead of iteratively reducing the image size(left),the use ofintegral images allows the up-scaling of theﬁlter at constant cost(right).H.Bay et al./Computer Vision and Image Understanding110(2008)346–359349a minimum of 2pixels (1pixel on every side)in order to keep the size uneven and thus ensure the presence of the central pixel.This results in a total increase of the mask size by 6pixels (see Fig.5).Note that for dimensions diﬀerent from l 0(e.g.the width of the central band for the vertical ﬁlter in Fig.5),rescaling the mask introduces rounding-oﬀerrors.However,since these errors are typically much smaller than l 0,this is an acceptable approximation.The construction of the scale space starts with the 9Â9ﬁlter,which calculates the blob response of the image for the smallest scale.Then,ﬁlters with sizes 15Â15,21Â21,and 27Â27are applied,by which even more than a scale change of two has been achieved.But this is needed,as a 3D non-maximum suppression is applied both spa-tially and over the neighbouring scales.Hence,the ﬁrst and last Hessian response maps in the stack cannot contain such maxima themselves,as they are used for reasons of comparison only.Therefore,after interpolation,see Sec-tion 3.4,the smallest possible scale is r ¼1:6¼1:2129corre-sponding to a ﬁlter size of 12Â12,and the highest to r ¼3:2¼1:224.For more details,we refer to [2].Similar considerations hold for the other octaves.For each new octave,the ﬁlter size increase is doubled (going from 6–12to 24–48).At the same time,the sampling inter-vals for the extraction of the interest points can be doubled as well for every new octave.This reduces the computation time and the loss in accuracy is comparable to the image sub-sampling of the traditional approaches.The ﬁlter sizes for the second octave are 15,27,39,51.A third octave is com-puted with the ﬁlter sizes 27,51,75,99and,if the original image size is still larger than the corresponding ﬁlter sizes,the scale space analysis is performed for a fourth octave,using the ﬁlter sizes 51,99,147,and 195.Fig.6gives an over-view of the ﬁlter sizes for the ﬁrst three octaves.Further octaves can be computed in a similar way.In typical scale-space analysis however,the number of detected interest points per octave decays very quickly,cf.Fig.7.The large scale changes,especially between the ﬁrst ﬁl-ters within these octaves (from 9to 15is a change of 1.7),renders the sampling of scales quite crude.Therefore,we have also implemented a scale space with a ﬁner sam-pling of the scales.This computes the integral image on the image up-scaled by a factor of 2,and then starts the ﬁrst octave by ﬁltering with a ﬁlter of size 15.Additional ﬁlter sizes are 21,27,33,and 39.Then a second octave starts,again using ﬁlters which now increase their sizes by 12pixels,after which a third and fourth octave follow.Now the scale change between the ﬁrst two ﬁlters is only 1.4(21/15).The lowest scale for the accurate version that can be detected through quadratic interpolation is s ¼ð1:2189Þ=2¼1:2.As the Frobenius norm remains constant for our ﬁlters at any size,they are already scale normalised,and no fur-ther weighting of the ﬁlter response is required,for more information on that topic,see [22].3.4.Interest point localisationIn order to localise interest points in the image and over scales,a non-maximum suppression in a 3Â3Â3neigh-bourhood is applied.Speciﬁcally,we use a fast variant introduced by Neubeck and Van Gool [33].The maxima of the determinant of the Hessian matrix are then interpo-lated in scale and image space with the method proposed by Brown and Lowe [5].Scale space interpolation is especially important in our case,as the diﬀerence in scale between the ﬁrst layers of every octave is relatively large.Fig.8shows an example of the detected interest points using our ‘Fast-Hessian’detector.4.Interest point description and matchingOur descriptor describes the distribution of the intensity content within the interest point neighbourhood,similartoFig.5.Filters D yy (top)and D xy (bottom)for two successive scale levels (9Â9and 15Â15).The length of the dark lobe can only be increased by an even number of pixels in order to guarantee the presence of a central pixel(top).Fig.6.Graphical representation of the ﬁlter side lengths for three diﬀerent octaves.The logarithmic horizontal axis represents the scales.Note that the octaves are overlapping in order to cover all possible scales seamlessly.350H.Bay et al./Computer Vision and Image Understanding 110(2008)346–359the gradient information extracted by SIFT [24]and its variants.We build on the distribution of ﬁrst order Haar wavelet responses in x and y direction rather than the gra-dient,exploit integral images for speed,and use only 64D.This reduces the time for feature computation and match-ing,and has proven to simultaneously increase the robust-ness.Furthermore,we present a new indexing step based on the sign of the Laplacian,which increases not only the robustness of the descriptor,but also the matching speed (by a factor of 2in the best case).We refer to our detec-tor-descriptor scheme as SURF—Speeded-Up Robust Features.The ﬁrst step consists of ﬁxing a reproducible orienta-tion based on information from a circular region around the interest point.Then,we construct a square region aligned to the selected orientation and extract the SURF descriptor from it.Finally,features are matched between two images.These three steps are explained in the following.4.1.Orientation assignmentIn order to be invariant to image rotation,we identify a reproducible orientation for the interest points.For that purpose,we ﬁrst calculate the Haar wavelet responses in x and y direction within a circular neighbourhood of radius 6s around the interest point,with s the scale at which the interest point was detected.The sampling step is scale dependent and chosen to be s .In keeping with the rest,also the size of the wavelets are scale dependent and set to a side length of 4s .Therefore,we can again use integral images for fast ﬁltering.The used ﬁlters are shown in Fig.9.Only six operations are needed to compute the response in x or y direction at any scale.Once the wavelet responses are calculated and weighted with a Gaussian (r ¼2s )centred at the interest point,the responses are represented as points in a space with the hor-izontal response strength along the abscissa and the vertical response strength along the ordinate.The dominant orien-tation is estimated by calculating the sum of all responses within a sliding orientation window of size p ,see Fig.10.The horizontal and vertical responses within the window are summed.The two summed responses then yield a local orientation vector.The longest such vector over all win-dows deﬁnes the orientation of the interest point.The size of the sliding window is a parameter which had to be cho-sen carefully.Small sizes ﬁre on single dominating gradi-ents,large sizes tend to yield maxima in vector length that are not outspoken.Both result in a misorientation of the interest point.Note that for many applications,rotation invariance is not necessary.Experiments of using the upright version of SURF (U-SURF,for short)for object detection can be found in [3,4].U-SURF is faster to compute and can increase distinctivity,while maintaining a robustness to rotation of about ±15°.4.2.Descriptor based on sum of Haar wavelet responses For the extraction of the descriptor,the ﬁrst step con-sists of constructing a square region centred around the interest point and oriented along the orientation selected in previous section.The size of this window is 20s .Exam-ples of such square regions are illustrated in Fig.11.The region is split up regularly into smaller 4Â4square sub-regions.This preserves important spatial information.For each sub-region,we compute Haar waveletresponsesFig.8.Detected interest points for a Sunﬂower ﬁeld.This kind of scenes shows the nature of the features obtained using Hessian-baseddetectors.Fig.9.Haar wavelet ﬁlters to compute the responses in x (left)and y direction (right).The dark parts have the weight À1and the light parts þ1.H.Bay et al./Computer Vision and Image Understanding 110(2008)346–359351at 5Â5regularly spaced sample points.For reasons of simplicity,we call d x the Haar wavelet response in horizon-tal direction and d y the Haar wavelet response in vertical direction (ﬁlter size 2s ),see Fig.9again.‘‘Horizontal’’and ‘‘vertical’’here is deﬁned in relation to the selected interest point orientation (see Fig.12).1To increase the robustness towards geometric deformations and localisa-tion errors,the responses d x and d y are ﬁrst weighted with a Gaussian (r ¼3:3s )centred at the interest point.Then,the wavelet responses d x and d y are summed up over each sub-region and form a ﬁrst set of entries in thefeature vector.In order to bring in information about the polarity of the intensity changes,we also extract the sum of the absolute values of the responses,j d x j and j d y j .Hence,each sub-region has a 4D descriptor vector v for its underlying intensity structure v ¼ðP d x ;P d y ;Pj d x j ;P j d y jÞ.Concatenating this for all 4Â4sub-regions,this results in a descriptor vector of length 64.The wavelet responses are invariant to a bias in illumina-tion (oﬀset).Invariance to contrast (a scale factor)is achieved by turning the descriptor into a unit vector.Fig.13shows the properties of the descriptor for three distinctively diﬀerent image-intensity patterns within a sub-region.One can imagine combinations of such local intensity patterns,resulting in a distinctive descriptor.SURF is,up to some point,similar in concept as SIFT,in that they both focus on the spatial distribution of gradi-ent information.Nevertheless,SURF outperforms SIFT in practically all cases,as shown in Section 5.We believe this is due to the fact that SURF integrates the gradient infor-mation within a subpatch,whereas SIFT depends on the orientations of the individual gradients.This makesSURFFig.10.Orientation assignment:a sliding orientation window of size p3detects the dominant orientation of the Gaussian weighted Haar wavelet responses at every sample pointwithin a circular neighbourhood around the interest point.Fig.11.Detail of the Graﬃti scene showing thesize of the oriented descriptor window at diﬀerent scales.Fig.12.To build the descriptor,an oriented quadratic grid with 4Â4square sub-regions is laid over the interest point (left).For each square,the wavelet responses are computed from 5Â5samples (for illustrative purposes,we show only 2Â2sub-divisions here).For each ﬁeld,we collect the sums d x ,j d x j ;d y ,and j d y j ,computed relatively to the orientation of the grid (right).1For eﬃciency reasons,the Haar wavelets are calculated in the unrotated image and the responses arethen interpolated,instead of actually rotating the image.Fig.13.The descriptor entries of a sub-region represent the nature of the underlying intensity pattern.Left:In case of a homogeneous region,all values are relatively low.Middle:In presence of frequencies in x direction,the value of P j d x j is high,but all others remain low.Ifthe intensity is gradually increasing in x direction,both values P d x andP j d x j are high.352H.Bay et al./Computer Vision and Image Understanding 110(2008)346–359。

slam特征跟踪方法

slam特征跟踪方法From a technical standpoint, SLAM feature tracking methods play a vital role in accurately estimating therobot's pose and mapping the environment. These methods typically rely on extracting and matching visual or geometric features across consecutive frames to establish correspondences and compute the robot's motion. Feature tracking algorithms should be robust to changes in lighting conditions, viewpoint variations, occlusions, and dynamic objects. Moreover, they should be able to handle large-scale environments and real-time processing requirements. Achieving these objectives is challenging due to the complexity and dynamic nature of real-world environments.One popular approach to SLAM feature tracking is theuse of feature descriptors, such as SIFT (Scale-Invariant Feature Transform) or ORB (Oriented FAST and Rotated BRIEF). These descriptors encode distinctive information about the features, allowing for reliable matching across frames. However, feature descriptors alone may not be sufficient inchallenging scenarios with significant viewpoint changes or occlusions. To address this, researchers have proposed methods that combine feature descriptors with geometric constraints, such as epipolar geometry or 3D point cloud information. These methods leverage the geometric relationships between the features to improve tracking accuracy and robustness.Another important aspect of SLAM feature tracking is the initialization of the tracking process. When a robot starts exploring a new environment, it needs to identify and track features from scratch. This initialization step is crucial for accurate motion estimation and subsequent mapping. Various methods have been proposed to address this challenge, including keypoint detection algorithms, such as Harris corners or FAST (Features from Accelerated Segment Test), which aim to identify salient features in the scene. Once the initial set of features is obtained, the tracking process can be initialized and refined using feature matching and motion estimation techniques.In recent years, deep learning-based approaches havealso shown promise in SLAM feature tracking. Convolutional neural networks (CNNs) have been employed to learn feature representations directly from raw image data, eliminating the need for handcrafted descriptors. These learned features can be more robust to variations in lighting and viewpoint, potentially improving tracking performance. Additionally, recurrent neural networks (RNNs) and long short-term memory (LSTM) networks have been explored for modeling temporal dependencies in feature tracking, enabling better handling of motion blur or fast camera movements.Despite the advancements in SLAM feature tracking methods, several challenges remain. One major challenge is the trade-off between tracking accuracy and computational efficiency. SLAM systems often operate in real-time, and the feature tracking component should be able to process frames at high frame rates while maintaining accurate estimates. This requires efficient feature detection, matching, and motion estimation algorithms. Another challenge is the robustness of feature tracking in dynamic environments. Moving objects or changes in the scene candisrupt feature correspondences and lead to tracking failures. Developing methods that can handle dynamic environments and recover from failures is an ongoing research topic.In conclusion, slam特征跟踪方法 (SLAM feature tracking methods) are crucial for enabling mobile robots to navigate and map their surroundings simultaneously. These methods involve extracting, matching, and tracking distinctive features in the environment to estimate the robot's motion and build a map. While feature descriptors and geometric constraints have been traditionally used, recent advancements in deep learning have opened new possibilities for improving tracking accuracy and robustness. However, challenges such as real-time processing, dynamic environments, and tracking initialization still need to be addressed. Continued research and development in SLAM feature tracking methods will contribute to the advancement of robotics and computer vision, enabling robots to operate autonomously in complex and dynamic environments.。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Robust feature matching in2.3µsSimon Taylor Edward Rosten Tom Drummond Department of Engineering,University of Cambridge Trumpington Street,Cambridge,CB21PZ,UK{sjt59,er258,twd20}@AbstractIn this paper we present a robust feature matching scheme in which features can be matched in2.3µs.For a typical task involving150features per image,this re-sults in a processing time of500µs for feature extraction and matching.In order to achieve very fast matching we use simple features based on histograms of pixel intensities and an indexing scheme based on their joint distribution. The features are stored with a novel bit mask representation which requires only44bytes of memory per feature and al-lows computation of a dissimilarity score in20ns.A train-ing phase gives the patch-based features invariance to small viewpoint rger viewpoint variations are han-dled by training entirely independent sets of features from different viewpoints.A complete system is presented where a database of around13,000features is used to robustly localise a single planar target in just over a millisecond,including all steps from feature detection to modelﬁtting.The resulting system shows comparable robustness to SIFT[8]and Ferns[14] while using a tiny fraction of the processing time,and in the latter case a fraction of the memory as well.1.IntroductionMatching the same real world points in different images is a fundamental problem in computer vision,and a vi-tal component of applications such as automated panorama stitching(e.g.[2]),image retrieval(e.g.[16])and object lo-calisation(e.g.[8]).Matching schemes must deﬁne a measure of similarity between parts of images,which in the ideal case is high if the image locations correspond to the same real-world point and low otherwise.The most basic description of a region of an image is a patch of pixel values.Patch matches can be found by searching for a pair of patches with a high cross-correlation score or a low sum-of-squared-differences (SSD)score.However patch matching with SSD provides no invariance to common image transformations suchas Figure1.Two frames from a sequence including partial occlusion and signiﬁcant viewpoint variation.The average total processing time per640x480frame for the sequence is1.37ms using one core of a2.4GHz processor.Extracting runtime patch descriptors and ﬁnding matches in the database accounts for520µs of this time. viewpoint change,and performing exhaustive patch match-ing between all possible pairs of patches is infeasible.Moravec proposed an interest point detector[13]to in-troduce some invariance to translation and hence reduce the number of patch matches to be considered.Interest point detection is now well-established as theﬁrst stage of state-of-the art matching schemes.There are many other trans-formations between the images,such as rotation and scale, which an ideal matching scheme should cope with.There are generally two approaches possible for each category of transformation;either factor out the effect of the transfor-mation,or make the representation of the area of interest invariant to it.Detecting interest points falls into theﬁrst category in that it factors out coarse changes in position.Schmid and Mohr[16]presented theﬁrst interest point approach to offer invariance to many image transforma-tions.A number of rotationally invariant features were com-puted around interest points in images.During matching the same features were computed at multiple scales to give the method invariance to both scale and rotation changes around the interest point.Instead of computing features invariant to rotation,a canonical orientation can be computed from the region around an interest point and used to factor out the effect of rotation.A variety of methods forﬁnding orientation have been proposed including the orientation of the largesteigenvector in Harris[4]corner detection,the maxima in an edge orientation histogram[8]or gradient direction at a very coarse scale[2].The interest point detection stage can also factor out more than just translation changes.Scale changes can be accounted for by a searching for interest regions over scale space[8,10].The space of afﬁne orientation has too many dimensions to be searched directly,so schemes have been proposed to perform local searches for afﬁne orientation starting from scale-space interest regions[11].Alterna-tively,interest regions can be found and afﬁne orientation deduced from the shape of the region[9].Schemes such as those above can factor out large changes due to many common imaging transformations,but differences between matching patches will remain due to errors in the assignment of the canonical parameters and unmodelled distortions.To give robustness to these errors the patches extracted from the canonical frames undergo a further stage of processing.Lowe’s SIFT(scale invari-ant feature transform)method[8]typiﬁes this approach and uses soft binning of edge orientation histograms which vary weakly with the position of edges.Other systems in this category include GLOH(Gradi-ent Location and Orientation Histogram)[12]and MOPS (Multi-scale Oriented Patches)[2]which extracts patches from a different scale image to the interest region detec-tion.Winder and Brown applied a learning approach toﬁnd optimal parameters for these types of descriptor[18].The CS-LBP descriptor[5]uses a SIFT-style histogram of local information from the canonical patches but the local infor-mation used is a binary pattern rather than the local gradient used in SIFT.All of the above approaches aim to compute a single de-scriptor for a real-world feature which is as invariant as pos-sible to all likely image transformations.Correspondences between images are determined by extracting descriptors from both images andﬁnding those that are close neigh-bours in feature space.An interesting alternative approach recasts the match-ing problem as one of classiﬁcation.This approach uses a training stage to train classiﬁers for the database features, which allows matching to be performed with less expensive computation at run-time than required by descriptor-based methods.Lepetit et al.demonstrated real-time matching us-ing randomised trees to classify patches extracted from lo-cation,scale and orientation-normalised interest regions[7]. Only around300bits are computed from the query images for each interest region to be classiﬁter work from Oyuzal et al.introduced the Ferns method[14]which im-proved classiﬁcation performance to the point where the orientation normalisation of interest regions was no longer necessary.These methods only perform simple computa-tions on the runtime image,however the classiﬁers need to represent complicated joint distributions for each feature and so a large amount of memory is required.This limits the approach to a few hundred features on standard desktop PCs.Runtime performance is of key importance for many applications.The template tracking system of Jurie and Dhome[6]performs well but in common with any tracking scheme relies on small frame-to-frame motion and requires another method for initialisation.Recent work on adapting the SIFT and Fern approaches to mobile phones[17]made trade-offs to both approaches to increase speed whilst main-taining usable matching accuracy.Our method is around 4times faster than these optimised implementations and acheives more robust localisation.Existing state-of-the-art matching approaches based on descriptor computation or patch classiﬁcation attempt to match any possible view of a target to a small set of key features.Descriptor-based approaches such as SIFT factor out image transformations with computationally expensive image processing.Classiﬁcation methods such as Ferns of-fer reduced runtime computation but have a high memory cost to represent the complex joint distributions involved.Our method avoids the complexity inherent to matching areas of images subject to large transformations.Instead we employ a training phase to learn independent sets of features for different views of the target,and insert them all into the database for the target.The individual features are only in-variant to small changes of viewpoint.This simpliﬁes the matching problem so neither the computationally expensive normalisation over transformations of SIFT-style methods or the complex classiﬁer of the Fern-like approach are re-quired.As we only require features to be invariant to small view-point changes we need far less invariance from our interest point detector than other matching schemes.The FAST-9 (Features from Accelerated Segment Test)detector[15]is a perfectﬁt for our application as it shows good repeatability over small viewpoint variations and is extremely efﬁcient as it requires no convolutions or searches over scale space.A potential problem with using features with less invari-ance than those of other approaches is that more database features will be required to allow robust matching over equivalent ranges of views at runtime.Therefore to make our new approach feasible we require features that have a low memory footprint and which permit rapid computation of a matching score.Our novel bit-mask patch feature ful-ﬁls these criteria.As runtime performance is our primary concern we would like to avoid too much processing on the pixels around the detected interest ing pixel patches would be one of the simplest possible matching schemes but SSD-based patch matching would not even provide the small amount of viewpoint invariance we desire.One of thereasons SSD is very sensitive to registration errors is that it assigns equal weight to errors from all the pixels in the patch.Berg and Malik[1]state that registration errors,at least for scale and rotation,will have more effect on samples further from the centre of the patch.The authors reduce the weight of errors in those samples by employing a variable blur which is stronger further from the centre of the patch. We use the idea that not all pixels in a patch are equally im-portant for matching,but further note that the weights which should be assigned to pixels also depend on the individual feature:samples in the centre of large regions of constant intensity will be robust to small variations in viewpoint.We employ a training phase to learn a model for the range of patches expected for each feature.This model al-lows runtime matching to use simple pixel patches whilst providing sufﬁcient viewpoint invariance for our frame-work.For fast localisation the memory and computational cost of matching is reduced by heavily quantising the model to a small binary representation that can be very efﬁciently matched at runtime.1.1.Our Contributions•We show fast and robust localisation of a target using simple features which only match under small view-point variations.•A large set of features from different views of a target are combined to allow matching under large transfor-mations.•We introduce a simple quantised-patch feature with a bit mask representation which enables very fast match-ing at runtime.The features represent the patch varia-tions observed in a training phase.2.Learning Features for a TargetWe use a large set of training images covering the entire range of viewpoints where localisation is required.The set of images could be captured for real,but we instead artiﬁ-cially generate the set by warping a single reference image. Different scales,rotations and afﬁne warps are included in the training set.Additionally random pixel noise and a blur of a small random size are added to each generated view so the trained features have more robustness to poor quality images.The training views for a target are grouped into sev-eral hundred viewpoint bins so that each bin covers a small range of viewpoints.The interest point detector is run on each image in the bin in sequence and patches are extracted from around the detected corners.The interest point loca-tions can be converted to a position in the reference frame as the warp between the reference and training image is known.If the database for the viewpoint already containsa Figure2.Left:The sparse8×8sampling grid used by the features. Right:The13samples selected to form the index.feature nearby the detected point in the new training image, then the patch model for that feature is updated with the new patch.Otherwise a new feature is created and added to the database.When all of the images in a viewpoint bin have been processed we select the n features(typically50-100) which were most repeatably detected by the FAST detec-tor and quantise their patch models to the binary feature descriptions used at runtime as described in the following section.2.1.Database Feature RepresentationThe features in our system are based on an8×8pixel patch extracted from a sparsely sampled grid around an in-terest point,as shown in Figure2.The extracted samples areﬁrstly normalised such that they have zero mean and unity standard deviation to give robustness to afﬁne lighting variations.During training we build a model of the feature which consists of64independent empirical distributions of normalised intensity,one per pixel of the sampling grid.This model can be used to calculate the likelihood that a runtime patch is from a trained feature,assuming each pixel is independent.However computing this likelihood estimate would require too much memory and computation time to be used in real-time on a large database of fea-tures.Since features only need to match over small view-point ranges we are able to heavily quantise the model for a feature and still obtain excellent matching performance.We quantise the per-pixel distribution in two ways. Firstly the empirical intensity distributions are represented as histograms with5intensity bins.Secondly when train-ing is complete we replace the probability in each bin with a single bit which is1if pixels rarely fell into the bin(less than5%of the time).The quantisation is illustrated in Fig-ure3.A feature in the database D can be written as:D0,0D0,1D0,2D0,3D0,4D1,0D1,1D1,2D1,3D1,4...............D63,0D63,1D63,2D63,3D63,4,(1) where a row D i,...corresponds to the quantised histogramThe independent per-pixel empirical distributions are quantised into5intensity bins,and then further quantised into a bit mask identifying bins rarely observed during the training phase. This process is shown for:(left)a constant intensity region,(cen-tre)a step change in intensity,(right)an intensity ramp.The data was created by taking the image(top)and adding random blur, noise and translation errors.for a single pixel of the patch,andD i,j={1if P(B j<I(x i,y i)<B j+1)<0.050otherwise.(2)where B j is the minimum intensity value of histogram bin j and I(x i,y i)is the normalised value of pixel i.The resulting descriptor requires5bits for each of the64 samples giving a total of40bytes of memory per feature.4 additional bytes are used to store the position of the feature in the reference image.3.Runtime MatchingAfter the quantisation to bits the patch descriptions no longer represent probability distributions and so we cannot compute the likelihood of a feature giving rise to a patch. However the bit mask does identify the intensity bins that samples rarely fell into at training time and so good matches should only have a small number of samples which fall into these bins in the runtime patch.Hence we use a count of the number of samples which fall into bins marked with a1in the database patch description as our dissimilarity score.The best matching feature in the database is the one that gives the lowest dissimilarity score when compared to the query patch,as that represents the match with fewest “errors”(runtime pixels in unexpected bins).The major advantage of the simple error count measure is that it can be computed with bitwise operations,which allows a large number of potential matches to be scored very quickly.The bitwise representation of a runtime patch R is slightly different to the database feature of equation1.It is also represented by a320-bit value but has exactly1bit set for each pixel,corresponding to the intensity bin which the sample from the runtime patch is in:R i,j={1if B j<RP(x i,y i)<B j+10otherwise.(3)where RP(x i,y i)is the value of pixel i in the normalised runtime patch extracted from around an interest point de-tected in a runtime image.With the preceeding deﬁnitions of the database and run-time patch representations the dissimilarity score can be simply computed by counting the number of bits where both D i,j and R i,j are equal to1:e=∑i,jD i,j⊗R i,j,(4)where⊗is a logical AND.Since each row of R always has one single bit set,this can be rewritten as:e=∑i((D i,0⊗R i,0)⊕...⊕(D i,4⊗R i,4))(5)where⊕denotes logical OR.By packing each column of D and R into a64bit integer(D j and R j)the necessary logical operations can be performed for all rows in parallel. The dissimilarity score can thus be obtained from a bitcount of a64-bit integer:e=bitcount((D0⊗R0)⊕...⊕(D4⊗R4))(6) Computing the error measure therefore requires5ANDs, 4ORs and a bit count of a64bit integer.Some architectures (including recent x86CPUs with SSE4.2)support a single-instruction bitcount.For other architectures,including our test machine,the bitcount can be performed in16instruc-tions using an11bit lookup table to count chunks of11bits at a time.The total time to compute an error measure using the lookup table bitcount is about20ns.Theﬁrst stage ofﬁnding matches from a runtime image is to run the FAST-9interest point detector.As the training phase has selected the most repeatable FAST features from each viewpoint it is not necessary to obtain too many inter-est points from the input image.We typicallyﬁnd no more than200are needed for robust localisation.The8×8patch of Figure2is extracted,and the mean and standard devi-ation of the samples are calculated to enable quantisation into the320-bits R i,j of equation3.The dissimilarity score between the patch and each database feature is computed using the fast method of equation6.The database feature with the lowest dissimilarity score for a runtime patch is treated as a match if the error count is below a threshold(typically5).The matches from all the runtime patches can be sorted by error count to order them in terms of quality.3.1.IndexingThe dissimilarity score between a runtime patch and a database feature can be computed very quickly using equa-tion6,however as we use larger numbers of features than alternative approaches it is desirable to combine the basic method above with an indexing scheme to reduce the num-ber of scores which must be computed and to prevent the search time growing linearly with the database size.The indexing approach we use is inspired by the Ferns work[14]which uses joint distributions of simple binary tests from training images.Our current implementation uses the13samples shown on the right of Figure2to com-pute an index number.The samples have been selected rea-sonably close to the patch centre as they are expected to be more consistent under rotation and scale,but somewhat spaced apart so that they are reasonably uncorrelated.Each of the samples selected for the index is quantised to a single bit:1if the pixel value is above the mean of the patch and0otherwise.The13samples are then concate-nated to form a13-bit integer.Thus the index in our cur-rent implementation can take values between0and8192. The index value is used to index a lookup table of sets of database features.At runtime the dissimilarity score is only computed against the set of features in the entry of the table with the matching index.The training phase is used to determine the set of index values which will account for most possible runtime views of a particular feature.Every patch from the training set that contributes to the model for a particular feature also con-tributes a vote for the index value computed from the patch. After training is complete we select the most-common in-dices until together the selected set of indices account for at least80%of the training patches used in building the fea-ture.This set of indices is saved with the feature,and the feature is inserted into all of the corresponding sets of fea-tures in the lookup table at runtime.3.2.Improving Robustness to BlurFAST is not an inherently multi-scale detector and fails to detect good features when the image is signiﬁcantly blurred.Although our training set includes some random blur so the features are trained to be robust to this we still rely on the repeatability of the detector toﬁnd the features in theﬁrst place.The few frames where blur is a problem in typical image sequences do not justify switching to a multi-scale detector,so we take a different approach.To perform detection in blurred images,we create an im-age pyramid with a factor of2in scale between images,and run FAST on each layer of the pyramid.In order to avoid incurring the cost of building the pyramid at each frame,we use a data driven approach to decide when to stop building the pyramid.Initially features are extracted and matched on the full-sized image.The features are then fed to the next stage of processing,such as estimating the camera pose.If the later stages of processing determine that there are too few good matches,then another set of features are extracted from the next layer of the image pyramid.These are aggregated with theﬁrst set of features,but the new features are assumed to have a better score.If again insufﬁcient matches are found, the next layer of the pyramid is used and so on until either enough good matches or a minimum image size has been reached.We choose a factor of2between images in the pyra-mid,as this allows for a particularly efﬁcient implementa-tion such that around200µs are required to half-sample a 640×480frame.We build a pyramid with a maximum of 3layers.The resulting system obtains considerable robust-ness to blur,since the blur in the smallest layer is reduced by a factor of4.Furthermore,it allows for matches to be made over a greater range of scales as the automatic fallback to sub-sampled images allows matching on frames when the camera is closer to the target than any training images. 4.Results and DiscussionIn order to validate our method,we apply it to the task of matching points in frames of a video sequence to a known planar object,andﬁnding the corresponding homography. Afterﬁnding matches the homography is estimated using PROSAC[3]and reﬁned using the inliers.The inlier set is reestimated and reﬁned for several iterations.The result-ing homography allows us to determine which points were matched correctly.The database for the frames shown in Figure1was gen-erated from a training set of21672images,generated by warping a single source image of the target.7different scale ranges and36different camera axis rotation ranges were used,giving a total of252viewpoint bins.Each bin covers a reduction in scale by a factor of0.8,10degrees of camera axis rotation,and out-of-plane viewpoints in all di-rections of up to30degrees.We extract around50features from each viewpoint bin(more from larger scale images), giving a total of13372features in the database.4.1.Validating the Bit Count Dissimilarity ScoreTwo short video sequences of the planar target of Figure 1were captured using a cheap VGA webcam.Theﬁrst se-quence was captured from viewpoints which were known to have been covered by our training phase whereas the second sequence was viewed with a larger out-of-plane rotation, known to be outside the range of training.The database fea-tures were trained from the source image,whereas the test sequences were poor-quality webcam images of a printed version of theﬁle.Thus both sequences test the method’sFAST interest point detection 0.55ms Building query bit masks 0.12ms Matching into database 0.35ms Robust pose estimation 0.1ms Total frame time 1.12msTable 1.Timings for the stages of our approachon a dataset with images taken from within the range of trainedviewpoints.Figure 4.The bit error count provides a reasonable way to deter-mine good matches.Left:matches from viewpoints contained in training set.Right:matches on viewpoints from outside training set.robustness to different imaging devices.Matching on the ﬁrst test sequence was very good,cor-rectly localising the target in all 754frames of the test se-quence.There was little blur in the sequence so the full frame provided enough matches in all but 7frames of the sequence,when the half-sampled image fallback was used to obtain enough matches for a conﬁdent pose estimate.The average total frame time on the sequence was 1.12ms on a 2.4GHz processor.The time attributed to each stage of the process is shown in Table 1.Somewhat surprisingly our method also performed rea-sonably well on the second sequence,even though it was known the frames were taken from views that were not cov-ered by our training set.On this sequence the target was lo-calised in 635frames of the 675in the sequence (94%).As expected the pose estimate using onlythe full-frame image was generally less conﬁdent so the fallbacks to sub-sampled images were used more often:377frames used the half-image and 63also used the quarter-scale image.Because of this additional workload the per-frame average time in-creased to 1.52ms.The matching performance on these test sequences sug-Figure 5.Increasing the range of viewpoint bins in the training set allows more viewpoint invariance to be added in a straightforward manner.gests that the bit count dissimilarity score provides a reason-able way of scoring matches.To conﬁrm this we computed the average number of inlier and outlier matches over all of the frames in the two sequences,and plotted these against the dissimilarity score obtained for the match in Figure 4.For the sequence on the left where the viewpoints are in-cluded in the training set many good matches are found in each frame,with on average 9.7zero-error inliers obtained.The inlier percentage for matches with low dissimilarity scores is also good at over 82%in the zero error case.The result that both the number of inliers and the inlier fraction drop off with increasing dissimilarity score demonstrates that the simple bit error count is a reasonable measure of the quality of a match.The ﬁgure provides strong support for a PROSAC-like robust estimation procedure once the matches have been sorted by dissimilarity score as the low error matches are very likely to be correct.Even when the viewpoint of the query image is outside the range for which features have been trained,as in the data on the right of Figure 4,the dissimilarity score still provides a reasonable way to sort the matches,as the inlier fraction can be seen to drop off with increasing dissimilarity.The inlier rate of the ﬁrst matches when sorted by dissimilarity score is still sufﬁcient in most frames to obtain a pose with a robust estimation stage such as PROSAC.4.2.Controllable Viewpoint InvarianceAs our framework uses independent features for different viewpoint bins it is possible to trade-off between robustness to viewpoint variations and computation required for local-isation by simply adding or removing more bins.For applications where viewpoints are restricted (for ex-ample if the camera has a roughly constant orientation)the number of database features can be drastically reduced lead-ing to even higher performance.Alternatively if more com-putational power is available it is possible to increase the。