A MULTIMODAL AND MULTILEVEL RANKING FRAMEWORK FOR CONTENT-BASED VIDEO RETRIEVAL

合集下载

大模型多模态融合能力指标

大模型多模态融合能力指标The ability to fuse multiple modalities in large-scale models is an important requirement in various fields suchas computer vision, natural language processing, and multimodal learning. This ability refers to the capabilityof a model to effectively integrate information fromdifferent modalities, such as text, image, audio, and video, to make accurate predictions or generate meaningful outputs. Evaluating the performance of models in terms of their multimodal fusion ability requires the definition and analysis of suitable metrics.One commonly used metric to assess the multimodalfusion ability of large models is the accuracy or performance on a specific task. For example, in an image captioning task, the model's ability to generate accurate and descriptive captions by effectively combining visualand textual information can be evaluated using metrics like BLEU (Bilingual Evaluation Understudy) or CIDEr (Consensus-based Image Description Evaluation). These metrics comparethe generated captions with human-generated reference captions and assess the quality of the generated outputs.Another perspective to evaluate the multimodal fusion ability is to analyze the model's attention mechanisms. Attention mechanisms allow models to focus on specific parts or modalities of the input during the fusion process. By examining the attention weights assigned to different modalities, we can gain insights into how well the model integrates and balances the information from each modality. For instance, in a visual question answering task, the attention weights can reveal whether the model is primarily relying on visual or textual cues to generate the answer.Furthermore, the interpretability of the model's fusion process is an important aspect to consider. A good multimodal fusion model should provide explanations or justifications for its predictions. This can be achieved through techniques such as attention visualization or saliency mapping, which highlight the important regions or modalities that contribute to the model's decision-making process. By providing interpretable fusion mechanisms,models can enhance transparency and trustworthiness, making them more suitable for real-world applications.In addition to accuracy, attention mechanisms, and interpretability, the efficiency of multimodal fusion is also a crucial aspect. Large-scale models often require significant computational resources, and efficient fusion techniques can help reduce the computational burden. Techniques like low-rank approximation, sparse coding, or knowledge distillation can be applied to compress or simplify the fusion process without significantly sacrificing performance. Evaluating the computational efficiency of multimodal fusion techniques ensures that models can be deployed in resource-constrained environments without compromising their overall performance.Lastly, the generalizability and transferability of multimodal fusion models should be considered. Models that can effectively fuse information from multiple modalities in one domain may not necessarily perform well in another domain. Therefore, it is important to evaluate the model's ability to generalize and transfer its fusion capabilitiesacross different tasks or datasets. Techniques such as domain adaptation or transfer learning can be employed to enhance the model's ability to adapt and generalize to new multimodal data.In conclusion, the evaluation of large-scale multimodal fusion models requires a comprehensive analysis from various perspectives. Metrics such as task performance, attention mechanisms, interpretability, efficiency, and generalizability should be considered to assess the model's fusion ability accurately. By evaluating models based on these criteria, researchers and practitioners can gain insights into the strengths and limitations of different multimodal fusion techniques, leading to the development of more robust and effective models in the future.。

Mathematical Modelling and Numerical Analysis Will be set by the publisher Modelisation Mat

& luskin@
c EDP Sciences, SMAI 1999
2

PAVEL BEL K AND MITCHELL LUSKIN
In general, the analysis of stability is more di cult for transformations with N = 4 such as the tetragonal to monoclinic transformations studied in this paper and N = 6 since the additional wells give the crystal more freedom to deform without the cost of additional energy. In fact, we show here that there are special lattice constants for which the simply laminated microstructure for the tetragonal to monoclinic transformation is not stable. The stability theory can also be used to analyze laminates with varying volume fraction 24 and conforming and nonconforming nite element approximations 25, 27 . We also note that the stability theory was used to analyze the microstructure in ferromagnetic crystals 29 . Related results on the numerical analysis of nonconvex variational problems can be found, for example, in 7 12,14 16,18,19,22,26,30 33 . We give an analysis in this paper of the stability of a laminated microstructure with in nitesimal length scale that oscillates between two compatible variants. We show that for any other deformation satisfying the same boundary conditions as the laminate, we can bound the pertubation of the volume fractions of the variants by the pertubation of the bulk energy. This implies that the volume fractions of the variants for a deformation are close to the volume fractions of the laminate if the bulk energy of the deformation is close to the bulk energy of the laminate. This concept of stability can be applied directly to obtain results on the convergence of nite element approximations and guarantees that any nite element solution with su ciently small bulk energy gives reliable approximations of the stable quantities such as volume fraction. In Section 2, we describe the geometrically nonlinear theory of martensite. We refer the reader to 2,3 and to the introductory article 28 for a more detailed discussion of the geometrically nonlinear theory of martensite. We review the results given in 34, 35 on the transformation strains and possible interfaces for tetragonal to monoclinic transformations corresponding to the shearing of the square and rectangular faces, and we then give the transformation strain and possible interfaces corresponding to the shearing of the plane orthogonal to a diagonal in the square base. In Section 3, we give the main results of this paper which give bounds on the volume fraction of the crystal in which the deformation gradient is in energy wells that are not used in the laminate. These estimates are used in Section 4 to establish a series of error bounds in terms of the elastic energy of deformations for the L2 approximation of the directional derivative of the limiting macroscopic deformation in any direction tangential to the parallel layers of the laminate, for the L2 approximation of the limiting macroscopic deformation, for the approximation of volume fractions of the participating martensitic variants, and for the approximation of nonlinear integrals of deformation gradients. Finally, in Section 5 we give an application of the stability theory to the nite element approximation of the simply laminated microstructure.

多层次logistic回归模型

多层次logistic回归模型英文回答：Logistic regression is a popular statistical model used for binary classification tasks. It is a type of generalized linear model that uses a logistic function to model the probability of a certain event occurring. The model is trained using a dataset with labeled examples, where each example consists of a set of input features and a corresponding binary label.The logistic regression model consists of multiple layers, each containing a set of weights and biases. These weights and biases are learned during the training process, where the model adjusts them to minimize the difference between the predicted probabilities and the true labels. The layers can be thought of as a hierarchy of features, where each layer learns to represent more complex and abstract features based on the input features from the previous layer.In the context of deep learning, logistic regression can be extended to have multiple hidden layers, resulting in a multi-layer logistic regression model. Each hidden layer introduces additional non-linear transformations to the input features, allowing the model to learn more complex representations. This makes the model more powerful and capable of capturing intricate patterns in the data.To train a multi-layer logistic regression model, we typically use a technique called backpropagation. This involves computing the gradient of the loss function with respect to the model parameters and updating the parameters using gradient descent. The backpropagation algorithm efficiently calculates these gradients by propagating the errors from the output layer back to the input layer.Multi-layer logistic regression models have been successfully applied to various domains, such as image classification, natural language processing, and speech recognition. For example, in image classification, a multi-layer logistic regression model can learn to recognizedifferent objects in images by extracting hierarchical features from the pixel values.中文回答：多层次logistic回归模型是一种常用的用于二分类任务的统计模型。

llm大语言模型量化方法

llm大语言模型量化方法LLM大语言模型量化方法是一种新的语言模型量化方法，它使用深度学习技术，在处理自然语言时能够大幅提高计算效率和准确率。

目前，LLM大语言模型量化方法已经被广泛应用于机器翻译、语音识别、自然语言处理等领域，并且取得了显著的成果。

量化是指将一个连续的数值空间转化为一个离散的数值空间。

在深度学习中，由于参数的数量极多，如果使用传统的精确计算方法，计算复杂度将会极高，从而导致计算时间过长，无法满足实时计算的需求。

因此，如何有效地降低计算复杂度成为一项关键技术。

LLM大语言模型量化方法就是一种高效降低计算复杂度的方法。

其基本思路是通过对神经网络中权重和激活输出的数值空间进行硬件友好的离散化处理，从而将实数域上的计算转化为整数域上的计算，从而降低计算复杂度，提高了计算效率。

具体来说，LLM大语言模型量化方法可分为三个步骤：第一步是量化网络的参数。

在此步骤中，通过对网络的参数进行离散化处理，将实数值离散化为整数值。

这样做可以有效地降低网络的存储空间，并且加速运算速度。

第二步是量化网络的输出。

在此步骤中，对网络的输出进行离散化处理，将实数域的输出转化为整数域的输出。

这样做可以大幅降低计算复杂度，提高运算速度，并且保持了原始网络的输出准确度。

第三步是训练量化网络。

在此步骤中，通过对离散化嵌入层的训练，可以提高网络的分辨率和准确性，使其更好地适应任务的要求。

同时，网络中的量化参数也可以被训练，以提高网络的分类精度。

总之，LLM大语言模型量化方法是一种高效的深度学习方法，其优点在于可以大幅降低计算复杂度，提高计算速度，并且保持了原始网络的输出准确度。

所以，LLM大语言模型量化方法在深度学习领域中有着广泛的应用前景，将为我们的生活带来更多的便利和创新。

模拟ai英文面试题目及答案

模拟ai英文面试题目及答案模拟AI英文面试题目及答案1. 题目: What is the difference between a neural network anda deep learning model?答案: A neural network is a set of algorithms modeled loosely after the human brain that are designed to recognize patterns. A deep learning model is a neural network with multiple layers, allowing it to learn more complex patterns and features from data.2. 题目: Explain the concept of 'overfitting' in machine learning.答案: Overfitting occurs when a machine learning model learns the training data too well, including its noise and outliers, resulting in poor generalization to new, unseen data.3. 题目: What is the role of a 'bias' in an AI model?答案: Bias in an AI model refers to the systematic errors introduced by the model during the learning process. It can be due to the choice of model, the training data, or the algorithm's assumptions, and it can lead to unfair or inaccurate predictions.4. 题目: Describe the importance of data preprocessing in AI.答案: Data preprocessing is crucial in AI as it involves cleaning, transforming, and reducing the data to a suitableformat for the model to learn effectively. Proper preprocessing can significantly improve the performance of AI models by ensuring that the input data is relevant, accurate, and free from noise.5. 题目: How does reinforcement learning differ from supervised learning?答案: Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a reward signal. It differs from supervised learning, where the model learns from labeled data to predict outcomes based on input features.6. 题目: What is the purpose of a 'convolutional neural network' (CNN)?答案: A convolutional neural network (CNN) is a type of deep learning model that is particularly effective for processing data with a grid-like topology, such as images. CNNs use convolutional layers to automatically and adaptively learn spatial hierarchies of features from input images.7. 题目: Explain the concept of 'feature extraction' in AI.答案: Feature extraction in AI is the process of identifying and extracting relevant pieces of information from the raw data. It is a crucial step in many machine learning algorithms, as it helps to reduce the dimensionality of the data and to focus on the most informative aspects that can be used to make predictions or classifications.8. 题目: What is the significance of 'gradient descent' in training AI models?答案: Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In the context of AI, it is used to minimize the loss function of a model, thus refining the model's parameters to improve its accuracy.9. 题目: How does 'transfer learning' work in AI?答案: Transfer learning is a technique where a pre-trained model is used as the starting point for learning a new task. It leverages the knowledge gained from one problem to improve performance on a different but related problem, reducing the need for large amounts of labeled data and computational resources.10. 题目: What is the role of 'regularization' in preventing overfitting?答案: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function, which discourages overly complex models. It helps to control the model's capacity, forcing it to generalize better to new data by not fitting too closely to the training data.。

A High Robustness and Low Cost Model for Cascading Failures

a r X i v :0704.0345v 1 [p h y s i c s .s o c -p h ] 3 A p r 2007epl draftThe network robustness has been one of the most central topics in the complex network research [1].In scale-free networks,the existence of hub vertices with high degrees has been shown to yield fragility to intentional attacks,while at the same time the network becomes robust to random failures due to the heterogeneous degree distribu-tion [2–5].On the other hand,for the description of dy-namic processes on top of networks,it has been suggested that the information ﬂow across the network is one of the key issues,which can be captured well by the betweenness centrality or the load [6].Cascading failures can happen in many infrastructure networks,including the electrical power grid,Internet,road systems,and so on.At each vertex of the power grid,the electric power is either produced or transferred to other vertices,and it is possible that from some reasons a vertex is overloaded beyond the given capacity,which is the maximum electric power the vertex can handle.The breakdown of the heavily loaded single vertex will cause the redistribution of loads over the remaining vertices,which can trigger breakdowns of newly overloaded ver-tices.This process will go on until all the loads of the remaining vertices are below their capacities.For some real networks,the breakdown of a single vertex is suﬃ-cient to collapse the entire system,which is exactly what happened on August 14,2003when an initial minor distur-bance in Ohio triggered the largest blackout in the history of United States in which millions of people suﬀered with-out electricity for as long as 15hours [7].A number of as-pects of cascading failures in complex networks have been discussed in the literature [8–16],including the model for describing cascade phenomena [8],the control and defense strategy against cascading failures [9,10],the analytical calculation of capacity parameter [11],and the modelling of the real-world data [12].In a recent paper [16],the cas-cade process in scale-free networks with community struc-ture has been investigated,and it has been found that a smaller modularity is easier to trigger cascade,which implies the importance of the modularity and community structure in cascading failures.In the research of the cascading failures,the following two issues are closely related to each other and of signif-icant interests:One is how to improve the network ro-bustness to cascading failures,and the other particularly important issue is how to design manmade networks with a less cost.In most circumstances,a high robustness and a low cost are diﬃcult to achieve simultaneously.For exam-ple,while a network with more edges are more robust to failures,in practice,the number of edges is often limited by the cost to construct them.In brevity,it costs much to build a robust network.Very recently,Sch¨a fer et.al.pro-posed a new proactive measure to increase the robustness of heterogeneous loaded networks to cascades.By deﬁn-ing the load dependent weights,the network turns to be more homogeneous and the total load is decreased,which means the investment cost is also reduced [15].In the present Letter,for simplicity,we try to ﬁnd a possible way of protecting networks based on the ﬂow along shortest-,(3)Nwhich we call the robustness from now on.For networks of homogeneous load distributions,the cascade does not happen and g≈1has been observed[8].Also for net-works of scale-free load distributions,one can have g≈1 if randomly chosen vertices,instead of vertices with high loads,are destroyed at the initial stage[8].In general,one can split,at least conceptually,the to-tal cost for the networks into two diﬀerent types:On the one hand,there should be the initial construction cost to build a network structure,which may include e.g.,the cost for the power transmission lines in power grids,and the cost proportional to the length of road in road networks. Another type of the cost is required to make the given network functioning,which can be an increasing function of the amount ofﬂow and can be named as the running cost.For example,we need to spend more to have big-ger memory sizes and faster network card and so on for the computer server which delivers more data packets.In the present Letter,we assume that the network structure is given,(accordingly the construction cost isﬁxed),and focus only on the running cost which should be spent in addition to the initial construction cost.Without consideration of the cost to protect vertices, the cascading failure can be made never to happen by assigning extremely high values to capacities.However, in practice,the capacity is severely limited by cost.We expect the cost to protect the vertex v should be an in-creasing function of c v,and for convenience deﬁne the cost e ase= N v=1 λ(l v)−1 /N.(4)It is to be noted that for a given value ofα,the original Motter-Lai(ML)capacity model in Ref.[8]has always a higher value of the cost than our model(see Fig.1).Al-though e=0atβ=1,it should not be interpreted as a costfree situation;we have deﬁned e only as a relative measure in comparison to the case ofλ(l)=1for all ver-tices.For a given network structure,the key quantities to be measured are g(α,β)and e(α,β),and we aim to in-crease g and decrease e,which will eventually provide us。

distilling the knowledge in a neural network

distilling the knowledge in a neural networkKnowledge Distilling Is a method of model compression, which refers to the method of using a more complex Teacher model to guide a lighter Student model training, so as to maintain the accuracy of the original Teacher model as far as possible while reducing the model size and computing resources. This approach was noticed, mainly due to Hinton's paper Distilling the Knowledge in a Neural Network.Knowledge Distill Is a simple way to make up for the insufficient supervision signal of classification problems. In the traditional classification problem, the goal of the model is to map the input features to a point in the output space, for example, in the famous Imagenet competition, which is to map all possible input images to 1000 points in the output space. In doing so, each of the 1,000 points is a one hot-encoded category information. Such a label can provide only the supervision information of log (class) so many bits. In KD, however, we can use teacher model to output a continuous label distribution for each sample, so that the supervised information is much more available than one hot's. From another perspective, you can imagine that if there is only one goal like label, the goal of the model is to force the mapping of each class in the training sample to the same point, so that the intra-class variance and inter-class distance that are very helpful for training will be lost. However, using the output of teacher model can recover this information. The specific example is like the paper, where a cat and a dog are closer than a cat and a table, and if an animal does look like a cat or a dog, it can provide supervision for both categories. To sum up, the core idea of KD is that "dispersing" is compressed to the supervisory information of a point, so that the output of student model can be distributed as much as the output of match teacher model as possible. In fact, to achieve this goal, it is not necessarily teacher model to be used. The uncertain information retained in the data annotation or collection can also help the training of the model.。

课文参考译文 (1)-信息科学与电子工程专业英语(第2版)-吴雅婷-清华大学出版社

Unit 16 大数据和云计算Unit 16-1第一部分：大数据当前，全世界迎来数据大爆炸的时代。

行业分析师和企业把大数据视为下一件大事，将其作为提供机会、见解、解决方案和增加业务利润的一种新途径。

从社交网站到医院的记录，大数据在改进企业和创新方面发挥了重要的作用。

大数据一词指庞大或复杂的数据集，由于信息来自关系复杂且不断变化的多个异构的独立源，并且不断增长，传统的数据处理应用软件都不足以处理它们。

大数据挑战包括捕获数据、数据存储、搜索、数据分析、共享、传输、可视化、查询、更新和隐私保护。

数据集的快速增长，部分原因是因为数据越来越多地通过众多价格低廉的物联网信息感知设备被收集起来，这些设备包括移动设备、软件日志、摄像机、麦克风、射频识别（RFID）阅读器和无线传感网等。

自20世纪80年代，世界人均技术信息存储量大约每40个月翻一番；截至2012，每天产生2.5艾字节（2.5×1018）的数据。

数据量不断增加，数据分析变得更具竞争力。

毫无疑问，现在可用的数据量确实很大，但这并不是这个新数据生态系统最重要的特征。

我们面临的挑战不仅是要收集和管理大量不同类型的数据，还要从中获取有效价值，这其中包括了预测分析、用户行为分析和其他高级数据分析方法。

大数据的价值正在被许多行业和政府的认可。

对数据集的分析可以找到新的关联性来发现商业趋势、预防疾病、打击犯罪等。

大数据类型大数据来自各种来源，可分为三大类：结构化、半结构化和非结构化。

-结构化数据：易于分类和分析的数据，例如数字和文字。

这种数据主要由嵌入在智能手机、全球定位系统（GPS）设备等电子设备中的网络传感器所产生。

结构化数据还包括交易数据、销售数据、帐户余额等。

其数据结构和一致性使得它能够基于机构的参数和操作需求来响应简单的查询，从而获取可用信息。

-半结构化数据：它是一种不符合显式和固定模式的结构化数据形式。

数据本身可自我描述，并且包含用于执行数据内记录和字段层次结构的标签或其他标记。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

A MULTIMODAL AND MULTILEVEL RANKING FRAMEWORK FORCONTENT-BASED VIDEO RETRIEV ALSteven C.H.Hoi and Michael R.LyuDepartment of Computer Science and Engineering and the Shun Hing Institute of Advanced Engineering The Chinese University of Hong Kong,Shatin,Hong KongE-mail:{chhoi,lyu}@.hkABSTRACTOne critical task in content-based video retrieval is to rank search results with combinations of multimodal resources ef-fectively.This paper proposes a novel multimodal and mul-tilevel ranking framework for content-based video retrieval. The main idea of our approach is to represent videos by graphs and learn harmonic ranking functions through fusing mul-timodal resources over these graphs smoothly.We further tackle the efﬁciency issue by a multilevel learning scheme, which makes the semi-supervised ranking method practical for large-scale applications.Our empirical evaluations on TRECVID2005dataset show that the proposed multimodal and multilevel ranking framework is effective and promising for content-based video retrieval.Index Terms—video retrieval,multimodal fusion,mul-tilevel ranking,semi-supervised learning,performance evalu-ation1.INTRODUCTIONWith the rapid growth of digital devices,Internet infrastruc-tures,and Web technologies,videos nowadays can be eas-ily captured,stored,uploaded,and delivered over the Web. Although general Web search engines have achieved many successes,searching video content over the Web is still chal-lenging.Most commercial Web search engines usually only index meta data of videos and search them by texts.Without understanding the media contents,traditional search engines may be limited in video retrieval.There should be a great room in improving traditional search engines for video re-trieval through exploiting the rich media contents.This makes content-based video retrieval(CBVR)a promising direction for developing future video search engines.In the past decades,content-based image retrieval has been actively studied in signal processing and multimedia commu-nities[1].Recently content-based video retrieval has attracted more and more research attention.From2001,TREC Video Retrieval(TRECVID)evaluation has been set up for bench-mark evaluations of video retrieval[2].In general,a content-based video search engine can be built upon a traditional text-based search engine with extracted video contents,such as speech recognition scripts,close captions,and video Opti-cal Character Recognition(OCR)texts.Although such video search engines receive beneﬁts from mature text search en-gine techniques,some nature of video data,such as noisy text transcripts,makes the content-based video search tasks much more difﬁcult than the traditional search tasks of text docu-ments.Therefore,it is clearly not enough to directly apply text-based search engine solutions on video retrieval tasks.In the past several years,some previous research efforts in content-based video retrieval have shown that the combina-tion of resources from multiple modalities is able to improve the retrieval performance of traditional text-based approaches on video search tasks[3,4,5].Despite promising improve-ments have been achieved in the past several years,until now, it remains a very challenging task for conducting content-based retrieval on large-scale video databases.Many difﬁcult open issues are not yet tackled.One of the most challeng-ing and essential issues is how to develop an effective learn-ing scheme in combining resources from multiple modalities for ranking search results and balancing the retrieval perfor-mance,while achieving computational efﬁciency for large-scale applications.To attack this challenge,we propose a multimodal and multilevel learning framework for ranking search results effectively,which not only can improve the re-trieval performance of traditional approaches,but also can be efﬁcient for large-scale video retrieval tasks.The rest of this paper is organized as follows.Section2 presents our multimodal and multilevel ranking framework and describes our methodology in detail.Section3discusses our experimental evaluations on TRECVID2005dataset.Sec-tion4sets out our conclusion.2.MULTIMODAL AND MULTILEVEL RANKINGFRAMEWORK2.1.OverviewIn this section we present a multimodal and multilevel learn-ing framework and discuss the engaged methodology.First of all,we describe how to represent videos by graphs.Based onthe graph representations,we suggest to learn harmonic rank-ing functions over the graphs by Gaussianﬁeld and harmonic functions.We then discuss how to fuse multiple resources for ranking through the graphs.Finally,we present the archi-tecture of our multimodal and multilevel learning framework, which is able to reach a good tradeoff between retrieval per-formance and computational efﬁciency.2.2.Video Representation and Graph Based Modeling Videos contain rich resources from multiple modalities,in-cluding text transcripts from speech recognition and low-level visual contents.In general,a video clip consists of audio channel and visual channel.From the audio channel,text in-formation can be extracted through speech recognition pro-cessing.High-level semantic events may also be detected from the audio channel.A video sequence in the visual chan-nel can be regarded as a series of image frames presented in a time sequence.Typically,such a video sequence can be rep-resented by a hierarchical structure:video,video stories,and video shots.A video shot is usually represented by a repre-sentative frame,or termed key frame,which is selected from frames presented in the shot.A video story is a video scene describing a complete semantic story,which is formed by a series of continuous video shots.In a video search task,a video shot is typically regarded as the basic search unit.Given the above video structure,for a video search task, we can represent the problem by a graphical model,which can be interpreted as a random walk problem from a proba-bilistic view[6].Fig.1gives an example to illustrate the idea. Fig.1(a)shows a set of video stories for retrieval;Fig.1(b)de-scribes the corresponding graph with respect to a given query topic.The“T”node represents the text content of the video story,while the“S”node represents a video shot.Note that links between“S”nodes are not plotted in theﬁgure for sim-plicity.Hence,given a query topic formed by texts(Q T)and visual contents(Q V),the retrieval task can be regarded as the problem ofﬁnding the shots(“S”nodes in theﬁgure)with maximal probabilities on the graph.2.3.Learning Harmonic Ranking FunctionsBased on the graph representation,given a retrieval task,a graph G can be constructed.Then,the video ranking task can be formulated into a learning problem of looking for a smooth function g over the graph.The value of function g on each node is regarded as the relevance score of the node with re-spect to the query target.From a random walk viewpoint[6], considering a particle starting from a query node,then the value of function g on a searching node can be regarded as the probability that the particle from the starting query node hits the current node.Next,we show how to use the principles of Gaussianﬁeld and harmonic function to learn a harmonic ranking function over the constructed graphs[7,8].(a)Set of videostories(b)Graph representation of video structureFig.1.Example of showing graph representation of video retrieval.Note that links between“S”nodes in(b)are not plotted for simplicity.Let usﬁrst consider a graph of single modality.For a search task,assume there are l labeled examples{(x i,y i)}l i=1 and u unlabeled examples{x i}l+ui=l+1to be ranked.The label value,y i is either equal to1for a positive example or0for a negative one.Let us construct a graph G=(V,E),where vertex set V=L∪U,and L and U are the sets of labeled and unlabeled examples,respectively.We then construct a weight matrix W,which characterizes the data manifold structure. The weight w ij between any two examples x i,x j∈R d can be computed as w ij=exp−dk=1(x ik−x jk)2σ2k,where x ik is the k-th component of the example x i andσk is the length scale parameter of each dimension.Now the ranking task is equivalent to the problem of as-signing a real-valued label to each example in the unlabled set ly,the goal is to learn some real-valued function g:V→R on the graph G according to some proper criteria. First,we constrain g to take values g(x i)=g l(x i)=y i on the labeled examples.Then,we look for a function g which is smooth with respect to the constructed graph.To this purpose, we try toﬁnd the function g that minimizes the quadratic en-ergy function as follows:g=arg ming|L=g l12i,jw ij(g(x i)−g(x j))2(1)According to the graph theory,the minimum energy func-tion g enjoys the harmonic property,which means that the value of g at each unlabeled example is the average of g at the neighboring examples.In order to solve the harmonic func-tion g by matrix operations,we calculate the diagonal matrix D=diag(d i),where d i=jw ij and W is the weight ma-trix.Then we let P=D−1W and split the matrices W,D,and P into four blocks similar to the following structure:W=W ll W luW ul W uu(2)Let us denote g=g lg u,where g u consists of the valuesof function g on the unlabeled data,which is regarded as the ﬁnal desirable ranking function.Consequently,the harmonic solution to thisﬁnal ranking function g u can be represented by the matrix operations as follows[7,8]:g u=(D uu−W uu)−1W ul g l=(I−P uu)−1P ul g l.(3)2.4.Multimodal Fusion Through GraphsNow let us discuss the issue of fusing multimodal resources to learn the ranking functions.The previous graph-based repre-sentation enables us to coherently fuse multimodal resources through graphs with proper probabilistic interpretation.Let us consider the example given in Fig.1.There are two modalities:text modality and visual modality.If we ignore the text modality,we can directly apply the above harmonic ranking solution to the visual modality.If the text informa-tion is included,each“S”node in the graph has two possible channels of hitting the starting query nodes.If we assume the hitting probability from the text channel is given byρ,then the probability of the other channel will be1−ρ.Here,ρis also regarded as a combination coefﬁcient of multimodal fu-sion.Thus,we tackle the multimodal fusion problem byﬁnd-ing the harmonic ranking function h u on the enhanced graph derived as follows:h u=(I−(1−ρ)P uu)−1((1−ρ)P ul g l+ρf u).(4)2.5.Multilevel Ranking FrameworkWe have outlined a semi-supervised ranking framework of learning harmonic ranking functions with multimodal resources through graphs.However,for a large-scale problem,directly applying the previous solution on the whole data may not be computationally efﬁcient.To attack this problem,we propose a multilevel ranking framework to achieve a good balance between retrieval performance and computational efﬁciency. The main idea is to combine multiple ranking strategies of different learning costs in multilevel learning stages.Fig.2 shows the architecture of our proposed multimodal and mul-tilevel ranking framework.In theﬁrst ranking stage,we em-ploy a text-based approach to retrieve the top M ranked video stories associated with a collection of N1video shots.In the second stage,combining with visual information,a nearest neighbor(NN)ranking strategy is engaged to retrieve the top N2video shots among N1shots.In the third stage,a super-vised large margin learning method,Support Vector Machine (SVM),is employed to rank on the N2shots and output the top N3shots.Finally,we apply the semi-supervisedranking Fig.2.Architecture of multimodal and multilevel ranking framework.method to rerank the top N4shots of SVM results and return users top K shots.It is clear that N1≥N2≥N3≥N4.3.EXPERIMENTAL RESULTS3.1.Experimental TestbedWe perform experiments on the TRECVID2005testbed[9]. The dataset contains277broadcast news videos totalling171 hours from6channels in3languages(English,Chinese,and Arabic).The automatic speech recognition(ASR)and ma-chine translation(MT)transcripts are provided by NIST[2]. We consider the automatic search task consisting of24query topics.Each query contains a text sentence and several im-age examples[9].For performance evaluation metrics,we employed non-interpolated average precision(AP)of a sin-gle query,and mean average precision(MAP)of multiple queries.3.2.Textual ProcessingTextual information comes from ASR and MT transcripts. The ASR transcripts are all time-stamped at the word level, while the MT transcripts are time-stamped at the sentence level.The text transcripts are segmented in video story level according to some story boundary detection method[10].Shots within a video story share the same text block.All text sto-ries and queries are parsed by a text parser with a standard list of stop words.The Okapi BM-25formula is used as the re-trieval model together with pseudo-relevance feedback(PRF) for text search.3.3.Visual Feature RepresentationThree types of visual features are used:color,shape and tex-ture.For color,we use Grid Color Moment.Each image is partitioned into3×3grids and three types of color momentsare extracted for each grid.Thus,an81-dimensional color moment is adopted for color.For shape,we employ an edge direction histogram.A Canny edge detector is used to get the edge image and then the edge direction histogram is computed.Each histogram is quantized into36bins of10degrees each.An additional bin is used to count the number of pixels without edge information. Hence,a37-dimensional edge direction histogram is used for shape.For texture,we adopt Gabor feature.Each image is scaled to64×64.Gabor wavelet transformation is applied on the scaled image with5scale levels and8orientations,which results in40subimages.For each subimage,three moments are computed:mean,variance,and skewness.Thus,a120-dimensional feature vector is adopted for texture.In total,a238-dimensional feature vector is employed to represent each key frame of video shots.3.4.Performance EvaluationWe compare our multimodal and multilevel(MMML)rank-ing scheme with other traditional approaches.For compari-son,the text-only approach is used as the baseline method. We also implemented two other ranking approaches:text search with visual reranking by Nearest Neighbor(Text+NN)and vi-sual reranking by Support Vector Machines(Text+SVM).All ranking methods return top1,000ranked shots for evaluation.In our MMML ranking,weﬁrst perform text-based rank-ing to retrieve the top20,000shots.Secondly,visual rerank-ing by NN is conducted to retrieve the top2,000shots.Thirdly, SVM is used to rerank them and return the top1,000shots.Fi-nally,semi-supervised ranking is performed to rerank the top 100shots of the SVM results.Table1summarizes the comparison of MAP results.First, our implementation of text-based method achieves the MAP result of0.0902,which is competitive with the best results re-ported in the TRECVID2005.For visual reranking methods, we can see that the NN reranking method is able to achieve an improvement of15.96%over the baseline method.This shows that the set of visual features is quite effective.Fur-ther,we found that the reranking method by SVM achieves more signiﬁcant results than the NN method.Finally,our MMML solution achieves the best MAP re-sult of0.1267,which improves the baseline method by40.47%. Fig.3also gives speciﬁc evaluation results on all24queries of TRECVID2005.These encouraging results show that our framework is effective and promising for content-based video retrieval tasks.4.CONCLUSIONIn this paper we propose a novel multimodal and multilevel ranking framework for content-based video retrieval.We ad-dress several challenges of content-based video retrieval and solve them effectively in our framework.Empirical results have shown that our method is effective and promising for future large-scale content-based video searchengines.Fig.3.Evaluation on all24queries of TRECVID2005 parison of MAP results of different ranking methodsMethods MAP ImprovementText Baseline0.09020%Text+NN0.1046+15.96%Text+SVM0.1171+29.82%MMML0.1267+40.47%5.ACKNOWLEDGMENTThe work described in this paper was fully supported by a grant from the Shun Hing Institute of Advanced Engineering,The Chinese Uni-versity of Hong Kong.6.REFERENCES[1] A.W.M.Smeulders,M.Worring,S.Santini,A.Gupta,and R.Jain,“Content-based image retrieval at the end of the early years,”IEEE T.PAMI,vol.22,no.12,pp.1349–1380,2000.[2]TRECVID,“TREC video retrieval evaluation,”in http://www-/projects/trecvid/.[3]Rong Yan,Jun Yang,and Alexander G.Hauptmann,“Learning query-class dependent weights in automatic video retrieval,”in Proc.ACM International Conference on Multimedia,New York,NY,USA,2004, pp.548–555.[4]Cees G.M.Snoek,Marcel Worring,and Arnold W.M.Smeulders,“Early versus late fusion in semantic video analysis,”in Proc.ACM International Conference on Multimedia,Singapore,2005,pp.399–402.[5]Winston H.Hsu,Lyndon S.Kennedy,and Shih-Fu Chang,“Videosearch reranking via information bottleneck principle,”in Proc.ACM International Conference on Multimedia,Santa Barbara,CA,USA, 2006,pp.35–44.[6]P.Doyle and J.Snell,“Random walks and electric networks,”Mathe-matical Assoc.of America,1984.[7]Xiaojin Zhu,Zoubin Ghahramani,and John Lafferty,“Semi-supervisedlearning using gaussianﬁelds and harmonic functions,”in Proc.ICML, 2003.[8]Steven C.H.Hoi and Michael R.Lyu,“A semi-supervised active learn-ing framework for image retrieval,”in Proc.IEEE CVPR,2005. [9]P.Over,W.Kraaij,and A.F.Smeaton,“TRECVID2005an overview,”in Proc.TRECVID Workshop,2005.[10]Winston H.Hsu and Shih-Fu Chang,“Visual cue cluster constructionvia information bottleneck principle and kernel density estimation,”in Proc.CIVR,Singapore,2005.。