Video Annotation Based on Kernel Linear Neighborhood Propagation
基于k-means++聚类的视频摘要生成算法

击 G 删
3 ) 由公式 ( 6 ) 计 算 每 个 向 量 被 选 为簇 中 心 的概 率 P ( X j ) , 当
P ( x . ) 最 大 时 对应 的 向量 就 是 新 的 簇 中 心 。
P ( x j ) = D( x) ‘ / ∑ D( x i ) ‘ ( 6 )
E0 c e nt er
频 分 解 为 图 像 序 列 , 并 做 预 采 样 处 理 ;然 后 ,提 取 所 有 预 采 样 帧 的 基 于 HS V 空 间 的 颜 色特 征 ;最 后 ,使 用 改 进 的 k — me a n s + + 算 法 对 所 有 的 预 采 样 帧进 行 聚 类 , 选 取 距 离聚 类 中心 最近 的帧 作 为 关键 帧 。
Ke y wo r d s: vi de o s u mma r i z a t i on, k —mea ns ++ c l u s t er i n g, co l o r s pa ce HS V
视 频 摘 要 技 术 是 目前 计 算 机 视 觉 领 域 的 研 究 热 点 , 本 文 提 出 了 一种 基 于 k - me a n s + + 聚 类 的视 频 摘 要 生 成 算 法 , 不 仅 提
关键词 : 视 频 摘要 , k - m e a n s + + 聚类。 H S V颜 色 空 间
Ab s t r a c t : I n o r de r t o f u r t h er i m pr o v e t h e qu al i t y o f t h e gen e r a t e d vi de o su m ma r i za t i o n。 a n al gor i t h m b as e d on k-me an s ++ c l u s t er i n g i s pr opo s e d i n t h i s pa per Fi r s t . d ec ompo s e t he o r i gi n al v i deo i n t o i ma ge s e qu en c es an d go t h r o ugh t h e pr e- s a m pl i n g pr oc e ss . T h en, ex t r a c t t h e c ol or f e at u r e s o f s a m pl e f r a m es ba s ed o n HS V c ol or s p ac e F i n al l y , cl us t e r s a m pl e f r a me s t h r ou gh an i mpr o v ed k-me an s ++ a l g or i t h m, a nd s el ec t t h e f r ame whi ch i s c l o s e s t t o t h e cl u s t e r i ng ce n t e r a s t h e k e y f r a me
一种Kannala模型的鱼眼相机标定方法优化

第 6 期
张春森等:一种 Kannala模型的鱼眼相机标定方法优化
1.2.1 径向畸变
径向畸变是沿着透镜的径向在畸变中心处的
图像像素 中 心 的 畸 变,即 远 离 透 镜 中 心 的 地 方 产
生的形变更大。径向畸变主要包括远离光轴和靠
近光轴产生的枕形畸变和桶形畸变。枕形畸变的
形成是由于图像的中心向光轴中心靠拢,如图 1
(a)所 示。 桶 形 畸 变 的 形 成 是 由 于 图 像 的 中 心 偏
(5)
图 1 径向畸变示意图 Fig.1 Radialdistortiondiagram
1.2.2 切向畸变
切 向 畸 变 是 由 于 安 装 误 差 而 引 起 的 畸 变,因
为切向畸 变 的 存 在,一 个 矩 形 投 影 到 成 像 平 面 上
可能会变成梯形。切向畸变由参数 p1 和 p2 描述, 经过校正的坐标为式(6)
Abstract:Inviewofthelargefieldofviewandultrashortfocallengthofthefisheyecamera,thetradi tionalcameracalibrationalgorithmcannotachievethecalibrationbasedonthesmallholeimagingmod el.ThispaperproposesafisheyecameracalibrationoptimizationbasedonthetraditionalKannalamod el.Firstly,astudyhasbeenmadeofthecameraimagingmodelanddistortiontypeofthefisheyecamer a,andonthebasisofthetraditionalKannalamodel,thepiecewisepolynomialapproximationmodelis establishedtorealizetheoriginalmodeloptimization.Then,theinternalparametersanddistortioncoef ficientsofthecameraareobtainedaccordingtothetraditionalKannalamodelandtheoptimization model,bywhichthedistortioncorrectionimagesareobtained.Finally,theadvantagesofthisalgorithm arequantitativelyandqualitativelyanalyzedbyusingthebackprojectionerrorandthemultiviewster eovision3Dreconstructionofthedistortedcorrectionimage.Theresultsindicatethatthecamerapa rametersanddistortioncoefficientsareobtainedbycalibrationtocorrecttheoriginalimageandtocarry outthemultiviewstereovision3Dreconstruction,andthereverseprojectionerroranalysisand3Dre constructionvisualizationofthecameracheckareprovedtobeeffectiveinthecalibrationoftheopti
基于深度学习的视频目标检测与跟踪算法研究

基于深度学习的视频目标检测与跟踪算法研究深度学习技术的快速发展为计算机视觉领域带来了革命性的变革。
在过去的几年里,深度学习在图像分类、目标检测和语义分割等领域取得了令人瞩目的成果。
然而,由于视频数据在时间和空间上的连续性,对视频进行准确的目标检测和跟踪依然是一个具有挑战性的问题。
本文将从深度学习的角度探讨视频目标检测与跟踪算法的研究进展。
一、视频目标检测算法视频目标检测算法旨在从视频序列中准确地找出并检测出关键的目标物体。
当前主流的视频目标检测算法主要有两种思路:单帧目标检测与时域信息融合和多目标追踪。
1. 单帧目标检测与时域信息融合单帧目标检测算法是基于图像目标检测算法的延伸,其主要思想是对每一帧图像进行目标检测,然后通过时域信息融合提高目标检测的准确性。
这种算法通常采用卷积神经网络(CNN)进行目标检测,如R-CNN、Faster R-CNN和YOLO等。
然而,由于视频数据的时间连续性,这些方法往往会忽略目标在时间上的一致性,造成检测结果的不准确。
为了解决这个问题,研究者们提出了一系列的时域信息融合方法,例如帧间插值、光流估计和长短时记忆网络(LSTM)。
这些方法可以从时间维度上对视频数据进行建模,从而提高目标检测的准确性。
此外,还有一些基于光流的方法,通过利用目标的运动信息提高目标检测的性能。
这些方法在许多基准数据集上取得了很好的效果,但是它们的计算复杂度较高,对硬件设备的要求也较高。
2. 多目标追踪多目标追踪算法旨在持续跟踪视频序列中的多个目标,并保持目标的标识信息不变。
当前主流的多目标追踪算法主要有两种思路:基于检测与跟踪的方法和基于在线学习与在线推断的方法。
基于检测与跟踪的方法将目标检测和目标跟踪视为两个独立的任务,首先通过目标检测算法找出视频序列中的目标,然后通过目标跟踪算法对目标进行跟踪。
这种方法的优点是可以利用目标检测算法的准确性,但是由于两个任务的相互独立性,容易导致检测错误和跟踪失败。
一种监控摄像机干扰检测模型

引言监控视频是犯罪案件调查中一项重要的信息来源与破案依据,但犯罪者极有可能通过对摄像机进行人为干预甚至破坏来掩盖其可疑活动,因此,如何有效地检测监控摄像机干扰事件具有重要的应用价值。
在目前的监控摄相机干扰检测方法中,特殊场景误检测依旧是巨大的挑战,如:照明变化、天气变化、人群流动、大型物体通过等。
对于一些纹理较少或无纹理背景的、黑暗或低质量的监控视频来说,大多数检测方法都会将其误识别为散焦干扰;对于使用带纹理物体进行镜头遮挡的情况,当遮挡物的灰度和亮度与图像背景非常相似时,无法将遮挡物与图像背景区分。
除此之外,物体缓慢地遮挡镜头也是一个具有挑战性的检测问题。
本文使用深度神经网络建立检测模型,利用改进的ConvGRU (Convolutional Gated Recurrent Unit )提取视频的时序特征和图像的空间全局依赖关系,结合Siamese 架构,提出了SCG 模型。
一种监控摄像机干扰检测模型刘小楠,邵培南(中国电子科技集团公司第三十二研究所,上海201808)摘要为减少监控干扰检测中因特殊场景引起的误检测,文中提出一种基于Siamese 架构的SCG(Siamese with Convolutional Gated Recurrent Unit )模型,利用视频片段间的潜在相似性来区分特殊场景与干扰事件。
通过在Siamese 架构中融合改进ConvGRU 网络,使模型充分利用监控视频的帧间时序相关性,在GRU 单元间嵌入的非局部操作可以使网络建立图像空间依赖响应。
与使用传统的GRU 模块的干扰检测模型相比,使用改进的ConvGRU 模块的模型准确率提升了4.22%。
除此之外,文中还引入残差注意力模块来提高特征提取网络对图像前景变化的感知能力,与未加入注意力模块的模型相比,改进后模型的准确率再次提高了2.49%。
关键词Siamese ;ConvGRU ;Non-local block ;相机干扰;干扰检测中图分类号TP391文献标识码A文章编号1009-2552(2021)01-0090-07DOI 10.13274/ki.hdzj.2021.01.016A surveillance camera tampering detection modelLIU Xiao -nan,SHAO Pei -nan(The 32nd Research Institute of China Electronics Technology Group Corporation ,Shanghai 201808,China )Abstract :This paper proposes an SCG model based on Siamese network to reduce the detection error caused by some special scenes in camera tampering detection.The model can use the potential similarity be⁃tween video clips to distinguish special scenes from tampering events.The improved ConvGRU network is in⁃tegrated to capture the temporal correlation between the frames of surveillance video.We embed non -local block s between GRU cells simultaneously,so the model can establish the spatial dependence of the image.The improved ConvGRU network improves model performance by 4.22%.We also add residual attention mod⁃ule to improve the perception ability of the model to the change of image foreground,this again improves the accuracy of the model by 2.49%.Key words :Siamese ;ConvGRU ;Non-local block ;camera tampering ;tampering detection作者简介:刘小楠(1994-),女,硕士研究生,研究方向为计算机视觉、深度学习。
采用HEVC的视频内容认证

采用HEVC的视频内容认证张明辉;冯桂【期刊名称】《华侨大学学报(自然科学版)》【年(卷),期】2017(038)005【摘要】提出一种基于高效视频编码(HEVC)的视频内容认证算法.根据图像纹理特征产生特征码,将特征码用于修改帧间8×8编码单元的分割模式、帧间预测模式和运动向量,并保留最佳的编码单元分割模式及相应的预测模式和运动向量.实验结果表明:该算法对视频质量影响很小,嵌入水印后码率的变化也很小;同时,该算法具有较好的脆弱性,可以用于视频认证.%In this paper,a video content authentication scheme based on high efficiency video coding (HEVC) has been proposed,the scheme used the feature codes generated from image texture to modify partition modes,inter-prediction modes and the value of motion vector of inter-frame 8 × 8 coding unit,and reserve the optimal coding unit splitting mode with corresponding prediction mode and motion vector.The experimental results show that our proposed algorithm has very small effect on video quality and bitrate.And our scheme can be used for authenticating video content owing to its good fragility.【总页数】6页(P721-726)【作者】张明辉;冯桂【作者单位】华侨大学信息科学与工程学院,福建厦门361021;华侨大学信息科学与工程学院,福建厦门361021【正文语种】中文【中图分类】TP391【相关文献】1.博通推出全球首款采用集成式高效视频压缩(HEVC)和MoCA2.0技术的直播卫星、地面及IP综合机顶盒单芯片系列产品 [J],2.采用可逆水印的 HEVC 视频完整性认证方案 [J], 董晓慧;林其伟;许东旭3.一种采用内预测模式的HEVC视频信息隐藏算法 [J], 董晓慧;林其伟;许东旭4.基于视频内容自适应拉格朗日参数选择的HEVC率失真编码优化 [J], 杨琳;何书前;石春5.采用深度帧内跳过模式的3D-HEVC视频水印算法 [J], 易银城; 冯桂; 陈婧因版权原因,仅展示原文概要,查看原文内容请购买。
核邻域保持判别嵌入在人脸识别中的应用

n n ie ra d s l s mpefrfc t. ee p rme trs lso eORL dYaefc aa s h w a i lo t ha o drc g ion o ln a n mal a l o a edaa Th x e i n e ut nt h n a l a ed tbaes o t th sag r hm sag o e o nt h t i i
。
由于人脸图像数据的高维数、非线性等特点 ,因此 ,如
W ANG Y n BAIW a r n a , n-o g
( l g f o ue n o Col eo mp tr dC mmu ia o , n h uUnv ri f e h o o y L n h u7 0 5 , h n ) e C a nc t n La z o i es yo T c n lg , a z o 3 0 0 C ia i t
基于深度学习的视频动作识别算法研究

基于深度学习的视频动作识别算法研究深度学习技术在计算机视觉领域取得了显著的突破,尤其在视频动作识别方面具有广泛的应用前景。
本文旨在探讨基于深度学习的视频动作识别算法研究,并对其应用进行深入分析。
一、引言随着计算机视觉技术的快速发展,视频动作识别成为了一个备受关注的研究领域。
传统的视频动作识别方法受限于特征提取和模式匹配等问题,难以实现准确、高效和鲁棒性强的动作识别。
而基于深度学习的视频动作识别算法则通过自动学习特征和模式,能够更好地解决这些问题。
二、基于深度学习的视频动作识别算法1. 卷积神经网络(CNN)卷积神经网络是一种广泛应用于图像处理任务中的深度学习模型。
通过卷积层和汇聚层等操作,CNN能够自动提取图像中具有判别性意义的特征。
在视频动作识别中,CNN可以通过对每一帧图像进行处理,并利用时序信息进行动作分类。
2. 循环神经网络(RNN)循环神经网络是一种能够处理序列数据的深度学习模型。
在视频动作识别中,RNN可以利用其记忆能力,对动作序列进行建模和分类。
通过引入长短期记忆(LSTM)单元,RNN能够有效解决长序列建模中的梯度消失和梯度爆炸问题。
3. 时空卷积神经网络(3D CNN)时空卷积神经网络是一种专门用于处理视频数据的深度学习模型。
3D CNN通过在时间维度上引入卷积操作,能够同时利用空间和时间信息进行特征提取。
相比于传统的2D CNN,3D CNN在视频动作识别中具有更好的性能。
三、基于深度学习的视频动作识别算法研究进展1. 特征表示学习基于深度学习的视频动作识别算法中,特征表示学习是一个关键问题。
传统方法通常采用手工设计的特征表示方法,而基于深度学习的方法则通过自动学习特征表示,避免了手工设计特征所带来的局限性。
2. 时序建模在视频动作识别中,时序建模是一个重要的任务。
通过引入循环神经网络等模型,可以对动作序列进行建模,从而更好地捕捉动作的时序信息。
此外,还可以通过引入注意力机制等方法,提升对关键帧或关键时间段的重要性建模能力。
基于视频自适应采样的快速图像检索算法

第 22卷第 7期2023年 7月Vol.22 No.7Jul.2023软件导刊Software Guide基于视频自适应采样的快速图像检索算法谭文斌1,黄贻望1,2,刘声1(1.铜仁学院大数据学院,贵州铜仁 554300; 2.贵州大学贵州省公共大数据重点实验室,贵州贵阳 550025)摘要:为解决智慧农业监控系统目标图像检索计算量较大、耗时较长的问题,提出一种视频自适应采样算法。
首先,根据视频相邻帧相似度变化情况自适应调整视频帧的采样率以提取视频关键帧,确保提取的关键帧能替代相邻帧参与目标图像检索计算。
然后,将视频关键帧以时间为轴构建视频帧检索算子,代替原视频参与目标图像检索计算,从而减少在视频中检索目标图像时的大量重复计算,达到提升检索效率的目的。
实验表明,自适应采样算法相较于固定频率采样、极小值关键帧算法所构建的视频帧检索算子检出率更高、更稳定。
在确保图像被全部检出的基础上,使用视频帧检索算子替代原视频参与目标图像检索计算的优化幅度较大,时耗减少了60%以上,对提升智慧农业监控系统中目标图像的检索效率具有重要意义。
关键词:自适应采样;图像相似度;目标图像帧;视频帧检索算子DOI:10.11907/rjdk.231260开放科学(资源服务)标识码(OSID):中图分类号:TP391.41 文献标识码:A文章编号:1672-7800(2023)007-0131-07A Fast Image Retrieval Algorithm Based on Video Adaptive SamplingTAN Wenbin1, HUANG Yiwang1,2, LIU Sheng1(1.College of Data Science, Tongren University, Tongren 554300, China;2.Guizhou Provincial Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China)Abstract:To solve the problem of high computational complexity and time-consuming target image retrieval in smart agricultural monitoring systems, a video adaptive sampling algorithm is proposed. Firstly, adaptively adjust the sampling rate of video frames based on changes in sim‐ilarity between adjacent frames to extract video keyframes, ensuring that the keyframes extracted by the algorithm can replace adjacent frames in target image retrieval calculations. Then, a video frame retrieval operator is constructed based on the time axis of the video keyframes, re‐placing the original video to participate in the target image retrieval calculation, thereby reducing a large number of repeated calculations when retrieving the target image in the video, and achieving the goal of improving retrieval efficiency. Experiments have shown that the adaptive sam‐pling algorithm has a higher and more stable detection rate than the video frame retrieval operator constructed by fixed frequency sampling and minimum keyframe algorithms. On the basis of ensuring that all images are detected, using video frame retrieval operators to replace the origi‐nal video in the calculation of target image retrieval has a significant optimization range, reducing time consumption by more than 60%, and is of great significance for improving the retrieval efficiency of target images in smart agricultural monitoring systems.Key Words:adaptive sampling; image similarity; target image frame; video frame retrieval operators0 引言近年来随着智慧农业的兴起,种植园逐步实现无人化、自动化和智能化管理。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
620IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008Video Annotation Based on Kernel Linear Neighborhood PropagationJinhui Tang, Student Member, IEEE, Xian-Sheng Hua, Member, IEEE, Guo-Jun Qi, Yan Song, and Xiuqing WuAbstract—The insufficiency of labeled training data for representing the distribution of the entire dataset is a major obstacle in automatic semantic annotation of large-scale video database. Semi-supervised learning algorithms, which attempt to learn from both labeled and unlabeled data, are promising to solve this problem. In this paper, a novel graph-based semi-supervised learning method named kernel linear neighborhood propagation (KLNP) is proposed and applied to video annotation. This approach combines the consistency assumption, which is the basic assumption in semi-supervised learning, and the local linear embedding (LLE) method in a nonlinear kernel-mapped space. KLNP improves a recently proposed method linear neighborhood propagation (LNP) by tackling the limitation of its local linear assumption on the distribution of semantics. Experiments conducted on the TRECVID data set demonstrate that this approach outperforms other popular graph-based semi-supervised learning methods for video semantic annotation. Index Terms—Kernel method, label propagation, semi-supervised learning, video annotation.I. INTRODUCTIONAUTOMATIC annotation (also called high-level feature extraction in TRECVID benchmark [1]) of video and video segments is an elementary step for semantic-level video search. As manually annotating large video archive is labor-intensive and time-consuming, many automatic approaches are proposed to handle this issue. Generally, these methods build statistical models from manually pre-labeled samples, and then assign the labels for the unlabeled ones using these models. However, this process has a major obstacle: frequently the labeled data is insufficient so that the distribution of the labeled data does not well represent the distribution of the entire dataset (include labeled and unlabeled data), which usually leads to inaccurate annotation results. Over the recent years, the availability of large data collections, which only have limited human annotation, has attracted the attention of a growing community of researchers to theManuscript received September 15, 2007; revised December 30, 2007; accepted February 19, 2008. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ling Guan. J. Tang, Y. Song, and X. Wu are with the Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China (e-mail: jhtang@; songy@ustc. ; wuxq@). X.-S. Hua is with the Microsoft Research Asia, Beijing 100080, China (e-mail: xshua@). G.-J. Qi is with the Department of Automation, University of Science and Technology of China, Hefei 230027, China (e-mail: qgj@). Color versions of one or more of the figures in this paper are available online at . Digital Object Identifier 10.1109/TMM.2008.921853problem of semi-supervised learning [2]. By leveraging unlabeled data with certain assumptions, semi-supervised learning methods are promising to deal with the above obstacle. Many works on this topic are reported in the literature of machine learning community and some of them have been applied to video or image annotation. In [3], co-training is applied to video annotation based on a careful split of visual features. Yan et al. [4] analyze the drawbacks of co-training in video annotation, and propose an improved co-training style algorithm named semi-supervised cross-feature learning. A structure-sensitive anisotropic manifold ranking method is proposed in [5] for video concept detection, where the authors analyze the graph-based semi-supervised learning methods from the view of PDE-based diffusion. In [6], manifold ranking (a graph-based semi-supervised learning method) based on feature selection is proposed for video annotation. Wang et al. [7] propose a method based on random walk with restarts to refine the results of image annotation. Tang et al. [8] embed the temporal consistency of video data into the graph based semi-supervised learning and propose a temporally consistent Gaussian random field method for video annotation. The key point of semi-supervised learning is the consistency assumption [9]: nearby samples in the feature space or samples on the same structure (also referred to as a cluster or a manifold) are likely to have the same label. This assumption considers both the local smoothness and the structure smoothness of the semantic distribution. There are close relations between the consistency assumption and nonlinear dimensionality reduction schemes [10]–[12] since intrinsically they follow the same idea of reducing the global coordinate system of the dataset to a lower-dimensional one while preserving the local distribution structure. Recently, a method called linear neighborhood propagation (LNP) [13], [14] is proposed to combine these two strategies. LNP borrows the basic assumption of local linear embedding (LLE) [10], [12] that each sample can be reconstructed by its neighboring samples linearly, and further assumes that the label of the sample can be reconstructed by the labels of its neighboring samples using the same coefficients. This method potentially assumes that the mapping from the feature to label is linear in a local area (to be detailed in Section II). Usually the local linear assumption is reasonable such as in LLE. However, using this assumption to construct the propagation coefficients in graph-based semi-supervised learning has limited depictive ability for the real-world datasets. And to ensure the convergence of the iterative label propagation in graph-based semi-supervised learning, the reconstruction coefficients are required to be non-negative. This makes the liner label reconstruction not so perfect. If the semantic super-plane has a high curvature in this area, LNP will fail. In other words, if the labels of the samples1520-9210/$25.00 © 2008 IEEETANG et al.: VIDEO ANNOTATION BASED ON KLNP621Define the mapping from the feature to label as we can obtain,(2)Fig. 1. Roadmap of KLNP.and in the local area distribute complexly in the feature space, this linear assumption is not appropriate. Therefore this method is not suitable to tackle the video semantic annotation problem for some certain semantic concepts, since typically some semantic distributions of video segments, which are collected from different sources and span a large time interval, are very complex in the feature space. Motivated by the great success of kernel trick [15], we propose a novel method for automatic video annotation named kernel linear neighborhood propagation (KLNP), which also combines the consistency assumption and LLE but is applied in a nonlinear kernel-mapped space. Fig. 1 shows the roadmap of KLNP. This method is able to handle more complex situation since it holds both the advantages of LNP and kernel methods, through mapping the input feature space to a nonlinear kernel-mapped feature space. Compared to the previous work in [16], we reconstruct the label reconstruction coefficients in a more effective way: one more constraint is added into the optimization problem for coefficients reconstruction, that is, every coefficient should be non-negative. This constraint is necessary to ensure the convergence of the iterative label propagation process. The coefficients in [16] cannot ensure the convergence of label propagation, so it is hardly to implement as it needs to compute the inversions of large matrices for label prediction. The experiments conducted on the TRECVID [1] dataset demonstrate that KLNP is more appropriate than LNP for complex applications and can obtain a more accurate result than other semi-supervised learning methods for video high-level feature extraction. The rest of this paper is organized as follows. In Section II, we briefly introduce LNP and analyze its limitation; and the proposed KLNP for the video semantic annotation problem is detailed in Section III. We analyze the incremental extension of KLNP and show that KLNP is equivalent to KPCA plus LNP in Section IV. Experiments are introduced in Section V, followed by the concluding remarks in Section VI. II. LNP AND ITS LIMITATION In this section, we briefly introduce LNP and analyze its limitation. LNP is based on the assumption that the label of each sample can be reconstructed linearly by its neighbors’ labels, and the coefficients of reconstructing the label is the same as the ones for reconstructing the feature vector of the sample. This can be formulated as (1) where of . is the neighbors of sample , and is the label (3) Combine (2) and (3), we have (4) Equation (4) indicates that the mapping from the feature to label is linear in a local area. In many times, the local linear assumption is reasonable for feature vector reconstruction, such as in LLE [10]. However, using this assumption to construct the propagation coefficients in graph-based semi-supervised learning has limited depictive ability for the real-world datasets. In fact, the linear assumption is a reduced case of the nonlinear one. So by extending to nonlinear propagation, the depiction ability of the model is strengthened. One of contributions of this paper is to reveal that it is an important property to describe the complex real-world problems with this strengthened nonlinear depiction, such as real-world video/image annotation on News dataset. Although the linear assumption has gained success on some benchmark dataset, such as UCI, the performances on the real-world sets have only reported limitedly. From the experiments, we can see on real-world datasets, the nonlinear assumption outperforms the linear one significantly. Besides, to ensure the convergence of the iterative label propagation process in LNP [13], the reconstruction coefficients are required to be non-negative. This makes the linear label reconstruction not so perfect, such as using linear assumption with non-negative coefficients cannot reconstruct the upmost point well by its near neighbors in a two dimensional space but through nonlinear mapping we can reconstruct it better. Especially when the semantics distribute complexly, using local linear assumption with non-negative coefficients on the semantics to reconstruct the label is not very effective. For example, the average label reconstruction error1 of Building on the training data of TRECVID05 [1] is 0.236, which is a little high. And from the viewpoint of LLE, LNP assumes that the semantic is a 1-D manifold embedded in the feature space. Obviously 1-D manifold cannot model complex semantic distribution well. So this method cannot well tackle the video semantic annotation problem since some semantics of video segments often have very complex distributions. We call this drawback limitation of local linear assumption on the distribution of semantics. III. KERNEL LINEAR NEIGHBORHOOD PROPAGATION To tackle the aforementioned limitation of LNP, in KLNP, we map the features to a higher dimensional space through kernel1Average label reconstruction error: (1=n) is the reconstructed label of x .((f0 f^ ) =f^ ), where f622IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008Fig. 2. Exemplary thumbnails of the 39 concepts.mapping and then try to obtain the reconstruction coefficients in this nonlinear space. This is motivated by the great success of kernel trick in pattern recognition area. KLNP also assumes that the label of each sample can be reconstructed linearly by its neighbors’ labels, but the label reconstruction in KLNP is more reasonable, as the kernel mapping is nonlinear and we can obtained better reconstruction coefficients in the kernel-mapped space. be a set of samLet (feature space ples (i.e., video shots for our application) in , where the sample with -dimensional features). set contains the first samples labeled for every concept, and as are unlabeled ones. Here the label “1” represents the sample is relevant for a certain concept while “0” represents the is irrelevant for the certain concept. It is well known that directly optimizing the integer label 1/0 is a NP hard problem. So these labels are usually relaxed to continuous labels (between 0 and 1). These continue label can be seen as the relevance score of each sample for a certain concept. And according to the task of high-level feature extraction in TRECVID [1], the objective here is to rank the remaining unlabeled samples, so the real-value scores are naturally suitable for this task. The vector of the predicted labels of all samples is represented as , which can be split into two blocks after the th row (5) Considering a kernel mapping space to a mapped space operating from inputThedatakernel matrixmapped . Then, of dot products can be represented assetcanbeto the(6) . For the kernel function, several where typical kernel functions, such as Radial Basis Function (RBF), Gaussian and sigmoid kernel, can be employed. In our experiments, we adopt the RBF as the kernel. of every We find the nearest neighbors using the following distance (for concision, we use to substi): tute(7) Please note this formula shows that the distance can be obtained directly from the kernel matrix instead of the mapping function . According to the assumption of LLE [10], can be linearly reconstructed from its neighbors. Using kernel mapping, the coefficients obtained in the mapped space can reconstruct the labels better than the coefficients obtained in the original feature space. To compute the optimal reconstruction coefficients, the reconstruction error of is defined as(8)TANG et al.: VIDEO ANNOTATION BASED ON KLNP623TABLE I COMPARISONS OF RESULTS BY USING THE FEATURE COLOR MOMENTSTABLE II COMPARISONS OF THE MAPS BY USING THE FEATURE COLOR MOMENTSwhereis the reconstruction coefficient for from , is the vector of is the entry of the “local” reconstruction coefficients, and of in the kernel-mapped space Gram matrix(10) where is a matrix formed by the mapped feature vectors for the nearest neighbors of th sample, “ ” is a -dimensional column vector with each entry equals to 1, and . Please note here the subscripts and do not mean is the element in th row and th column in in is decided by the positions the matrix. The position of and in . For example, if is the th column and of is the th column in , is of the element in th row and th column of . To assure the convergence of iterative label propagation, another constraint is added to reconstruct the propagation coefficients (i.e. reconstruction coefficients), which is the reconstruction coefficient should be non-negative. Then we can obtain the by solving the stanoptimal reconstruction coefficients for dard quadratic programming problem(11) Similar to the distance measure (see (7)), we can see that the Gram matrix in (10) also can be calculated directly from the kernel matrix instead of the mapping function. Therefore, in the entire computing procedure, the mapping function actually is not explicitly required. It is intuitive that the obtained reconstruction coefficients reflect the intrinsic local semantic structure of the samples in the mapped space. These coefficients will be applied to reconstruct the unlabeled samples’ labels (which are real values instead of “0” or “1”), that is, to estimate the prediction function . In order to obtain the optimal , we define the following cost functionEnforce the constraint that can be rewritten as, the reconstruction error(9) (12)624IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008where is the label of sample . Minimizing this cost will optimally reconstruct the labels of all unlabeled samples from the counterparts of their neighbors. And from the view of label propagation, minimizing this cost results in iterative label information propagations from labeled samples to other samples according to the linear neighborhood structure in the nonlinear mapped space. Define a sparse matrix with the entry in the th row and th column3:4: 5: 6:(13) then this optimization objective is represented formally asin a matrix with th row including neighbors’ indices of th sample, also store their distances in a matrix ; according to (10), then Compute the Gram matrix can be computed by solving the optimization problem in (11); Construct the label propagation coefficients matrix according to (13); Iterate (18) until convergence, then the unlabeled samples’ real-value labels can be obtained; Rank the unlabeled samples according the predicted labels. IV. DISCUSSION AND EXTENSION(14) This is a standard graph-based semi-supervised learning problem [17]. Notice that and split the matrix after the th row and th column(15) we can represent the optimization problem in a matrix manner as(16) Similar to [17], the optimal solution can be obtained as (17) Since and , as discussed in [5], the specand are both less than 1, so the label tral radiuses of prediction can also be accomplished in an iterative manner, that is, iterate(18) until convergence. This is actually an information propagation process from the label of each sample to its neighbors. The main procedure of the above algorithm is summarized in Algorithm 1. Algorithm 1 KLNPSince multimedia database is always expanding quickly. So how to handle the incremental samples is an important topic in multimedia research [18]. Here we discuss this issue in KLNP. Assume that we have incremental samples (lawill be beled and unlabeled), add them into , then . For every , search the -nearest samples and their distances , then we can calculate the reconstruction by solving the optimization problem coefficients vector , if is smaller (11). Meanwhile, for every than the largest item in the th row of , replace the corresponding neighbor of with and update the reconstruction . After the label propagation matrix is coefficients vector updated, the new labels can be obtained by iterating (18) until convergence. It has been shown in [19] that many kernel algorithms are usually equivalent to conducting the linear algorithm in the kernel principal component analysis (KPCA) [20] transformed feature space. KLNP also has this property, that is, KLNP can be regarded as and achieved by KPCA followed by LNP. Next is an analysis on this issue. Suppose the th nonzero eigenvalue corresponding eigenvector calculated in the KPCA is , then the th projection of in the KPCA is . So the projections of to all computed KPCA projection directions can be represented as . After the projection, the reconstruction error in LNP can be represented in the KPCA transformed feature space as(19) 1: Using the RBF as the kernel, compute the kernel with respect to (when matrix , then , so is a sparse matrix as most samples are far away from each other); 2: Find the nearest neighbors of each sample in using the distance measure in (7), store the neighbors’ indices We can see that . Since the eigenvectors are orthogonal with each other, so . As the eigenvector are required to be normalized, i.e.,TANG et al.: VIDEO ANNOTATION BASED ON KLNP625TABLE III COMPARISONS OF RESULTS BY USING THE FEATURE EDGE DISTRIBUTION LAYOUTTABLE IV COMPARISONS OF THE MAPS BY USING THE FEATURE EDGE DISTRIBUTION LAYOUT[20], so obtain. We can (20)between (20) and (9), There is only a difference of constant so they are the same as a minimization problem. Now we can see that KLNP is equal to KPCA plus LNP. V. EXPERIMENTS To evaluate the proposed KLNP for video annotation, we conduct the experiments on the benchmark video corpus of the TRECVID 2005, which is consisted of about 170 hoursof TV news videos from 13 different programs in English, Arabic and Chinese [1]. We use the development (DEV) set of TRECVID05 in our experiments. After automatic shot boundary detection, the DEV set contains 43 907 shots. Some shots are further segmented into subshots, and there are 61 901 subshots for DEV set. These data are divided into two parts with 81% (50 000 subshots) as training set and 19% (11 901 subshots) as test set. For each subshot, 39 concepts are labeled as positive (“1”) or negative (“0”) according to LSCOM-Lite annotations [21]. These annotated concepts consist of a wide range of genres, including program category, setting/scene/site, people, object, activity, event, and graphics. Fig. 2 shows the exemplary thumbnails of these concepts. We adopt -NN here to find the neighboring points ( is set to 30 empirically). The process of searching neighbors and calculating coefficients is relatively time-consuming. It needs about 15 hours (with Intel P4 3.0G and 2G memory). However, this process is also required in other graph-based semi-supervised learning methods such as LNP and GRF. And fortunately it can be calculated offline. The parameters in these methods are all tuned to be nearly optimal through five-fold cross validations while is empirically set to 0.99 in consistency method. Once the label propagation coefficients are obtained, time costs of KLNP, GRF method and consistency method are all about one minutes, and LNP costs about 20 s. Although KLNP is a little slower than LNP, it is computationally effective as the processing time is always a challenging problem in TRECVID tasks. For each concept, systems are required to return ranked-lists of up to 2000 subshots, and system performance is measured via the official performance metric Average Precision (AP) [22] in the TRECVID tasks. The AP corresponds to the area under a noninterpolated recall/precision curve and it favors highly ranked relevant subshots. We average the APs over all the 39 concepts to create the Mean Average Precision (MAP), which is the overall evaluation result. The low level features we used here are: • 225-D block-wise color moments in LAB color space, which are extracted over 5 5 fixed grid partitions, each block is described by a 9-D feature; • 75-D edge distribution layout; • 144-D color correlogram in HSV color space. To avoid the curse of dimensionality, we conduct the experiments on the three feature sets respectively and then the three result scores of each sample for a certain concept are combined as a fused score using linear fusion through cross validations. For performance evaluation, we compare our algorithm with LNP [13] and other two popular semi-supervised learning626IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 4, JUNE 2008TABLE V COMPARISONS OF RESULTS BY USING THE FEATURE COLOR CORRELOGRAMTABLE VII COMPARISONS OF THE FUSED RESULTSTABLE VI COMPARISONS OF THE MAPS BY USING THE FEATURE COLOR CORRELOGRAMmethods: Gaussian random field [17] and consistency method [9]. The experimental results are shown in Table I (by using the feature color moments), Table III (by using the feature edge distribution layout), Table V (by using the feature color correlogram), and Table VII (for fused results) respectively.Comparing these results, we can see that KLNP performs the best on 27 of the all 39 concepts using the color moment, performs the best on 26 of the all 39 concepts using the edge distribution layout and performs the best on 23 of the all 39 concepts using the color correlogram. After linear fusion, KLNP performs the best on 27 of the all 39 concepts. The comparisons of MAPs between the KLNP, LNP, GRF, and the consistency method are shown in Table II, IV, VI and VIII, respectively. The MAP of KLNP is 0.27355 by using the feature color moment, which has an improvement of 3.14%, 8.86%, and 50.52% over LNP, GRF, and the consistency method, respectively. It has a MAP of 0.20604 by using the feature edge distribution layout, which respectively has an improvement of 5.51%, 8.35%, and 135.42% over LNP, GRFTANG et al.: VIDEO ANNOTATION BASED ON KLNP627TABLE VIII COMPARISONS OF THE MAPS OF FUSED RESULTSand consistency method. And it obtain a MAP of 0.24859 by using the feature color correlogram, which has an improvement of 6.17%, 4.97%, and 60.25% over LNP, GRF, and the consistency method, respectively. Through linear fusion, KLNP obtains a MAP of 0.30397, which has an improvement of 4.85% over LNP, 5.80% over GRF and 60.06% over the consistency method. All these comparisons demonstrate that KLNP is more appropriate than other semi-supervised learning methods and is effective for semantic video annotation. One thing should be mentioned, the performance of consistency method here is not as good as in many other applications [9]. The main reason is that we require the propagation matrix to be sparse for the large scale application and this requirement destroys the matrix’s symmetry which is needed in consistency method. VI. CONCLUSION We have analyzed the linear limitation of local semantics for LNP on ranking data with complex distribution, and proposed an improved method named KLNP, in which a nonlinear kernel-mapped space is introduced for optimizing coefficients reconstruction. The experiments conducted on the TRECVID dataset demonstrate that the proposed method is more appropriate for the data with complex distribution and is effective for the semantic video annotation task. REFERENCES[1] Guidelines for the Trecvid 2005 Evaluation [Online]. Available: http:// wwwnlpir. /projects/tv2005/tv2005.html [2] O. Chapelle, A. Zien, and B. Scholkopf, Semi-Supervised Learning. Cambridge, MA: MIT Press, 2006. [3] Y. Song, X.-S. Hua, L. Dai, and M. Wang, “Semi-automatic video annotation based on active learning with multiple complementary predictors,” in ACM Int. Workshop on Multimedia Information Retrieval, Singapore, Nov. 2005, pp. 97–104. [4] R. Yan and M. Naphade, “Semi-supervised cross feature learning for semantic concept detection in videos,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Diego, CA, Jul. 2005, pp. 657–663. [5] J. Tang, X.-S. Hua, G.-J. Qi, M. Wang, T. Mei, and X. Wu, “Structuresensitive manifold ranking for video concept detection,” in Proc. ACM Multimedia, Augsburg, Germany, Sep. 2007, pp. 852–861. [6] X. Yuan, X.-S. Hua, M. Wang, and X. Wu, “Manifold-ranking based video concept detection on large database and feature pool,” in Proc. ACM Multimedia, Santa Barbara, CA, Oct. 2006, pp. 623–626. [7] C. Wang, F. Jing, L. Zhang, and H.-J. Zhang, “Image annotation refinement using random walk with restarts,” in Proc. ACM Multimedia, Santa Barbara, CA, Oct. 2006, pp. 647–650. [8] J. Tang, X.-S. Hua, T. Mei, G.-J. Qi, and X. Wu, “Video annotation based on temporally consistent Gaussian random field,” Electron. Lett., vol. 43, no. 8, 2007. [9] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, , S. Thrun, L. Saul, and B. Schölkopf, Eds., “Learning with local and global consistency,” in Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press, 2004, pp. 321–328.[10] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, 2000. [11] M. Belkin and P. Niyogi, , T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds., “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in Neural Information Processing System 14. Cambridge, MA: MIT Press, 2002, pp. 585–591. [12] L. K. Saul and S. T. Roweis, “Think globally, fit locally: Unsupervised learning of low dimensional manifolds,” J. Mach. Learning Res., pp. 119–155, 2003. [13] F. Wang and C. Zhang, “Label propagation through linear neighborhoods,” in Proc. 23rd Int. Conf. Machine Learning, Jun. 2006, pp. 985–992. [14] F. Wang, J. Wang, C. Zhang, and H. C. Shen, “Semi-supervised classification using linear neighborhood propagation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, New York, Jun. 2006, pp. 160–167. [15] B. Schölkopf, , T. K. Leen, T. G. Dietterich, and V. Tresp, Eds., “The kernel trick for distances,” in Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2001, pp. 301–307. [16] J. Tang, X.-S. Hua, Y. Song, G.-J. Qi, and X. Wu, “Kernel-based linear neighborhood propagation for semantic video annotation,” in Proc. 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining, Nanjing, China, Jun. 2007, pp. 793–800. [17] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic function,” in Proc. 20th Int. Conf. Machine Learning, Washington, DC, Aug. 2003, pp. 912–919. [18] D. Tao, X. Tang, X. Li, and Y. Rui, “Direct kernel biased discriminant analysis: A new content-based image retrieval relevance feedback algorithm,” IEEE Trans. Multimedia, vol. 8, no. 4, pp. 716–727, Aug. 2006. [19] D. Tao, X. Li, and S. Maybank, “Negative samples analysis in relevance feedback,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 4, pp. 568–580, Apr. 2007. [20] B. Scholkopf, A. Smola, and K.-R. Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, pp. 1299–1319, 1998. [21] M. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis, “Large-scale concept ontology for multimedia,” IEEE Multimedia Mag., vol. 16, no. 3, pp. 86–91, Jul.–Sep. 2006. [22] Trec-10 Proceedings Appendix on Common Evaluation Measures [Online]. Available: /pubs/trec10/appendices/measures.pdfJinhui Tang (S’04) received the B.S. degree in 2003 from the University of Science and Technology of China, Hefei, where he is currently pursuing the Ph.D. degree in the Department of Electronic Engineering and Information Science. From June 2006 to February 2007, he was a research intern with the Internet Media Group at Microsoft Research Asia. From February 2008 to May 2008, he was a research intern with the School of Computing, National University of Singapore. His current research interests include content-based video analysis, pattern recognition, and image retrieval. Mr. Tang is a student member of the Association for Computing Machinery.Xian-Sheng Hua (M’05) received the B.S. and Ph.D. degrees from Beijing University, Beijing, China, in 1996 and 2001, respectively, both in applied mathematics. Since 2001, he has been with Microsoft Research Asia, where he is currently a Lead Researcher in the Internet Media Group. His research interests are in the areas of video search, content-based video analysis, pattern recognition, and machine learning. He has authored more than 100 publications in these areas and has 20 patents or pending applications. Dr. Hua is an Associate Editor of IEEE TRANSACTIONS ON MULTIMEDIA and has been on the Editorial Board of the Journal of Multimedia Tools and Application since 2007. He is a member of the Association for Computing Machinery.。