Sparse Multi-Modal Hashing

合集下载

稀疏表示字典训练方法及应用

基于稀疏表示的人脸识别方法
马毅的方法中使用的是人脸图像构成字典，该论文通过对样本图像经过Discriminative K-SVD训练，得到字典D。设线性分类器 H W b hi [0,0,,1,,0,0]
Discriminative K-SVD
对其直接进行K-SVD字典训练
初始字典D
稀疏编码
字典更新
K-SVD
(1)初始字典D：过完备DCT字典 (2)稀疏编码
Min
A
DX Y
2 F
s.t. j, x j
0
L
D
Y
T
对 jth 列
Min

Dx y j
2
s.t . x
2
0
L
K-SVD
(3) 字典更新
Y DX
2 F
Y d j xT j
j 1 F
基于字典训练的单信道图像分离
方法1：对K-SVD算法进行改进，在SVD分解更新字典时，选择与其他信号相干性较小的原子。方法2：对GAD算法进行改进，选择图像（残余图像）中与其他图像相干性较小的原子生成字典。
(, ) n max k , j
1 k , j n

According to the theorem in literatures, the incoherence between two matrixes indicates that the atoms of one matrix can’t sparsely represent the atoms of the other (and vice versa). E.J. Candes and M.B. Wakin, “An Introduction to Compressive Sampling,” IEEE Signal Processing Magazine, vol. 25, iss. 2, pp. 21 – 30, 2008.

深度联合语义跨模态哈希算法

2022年3月第3期Vol. 43 No. 3 2022小型微型计算机系统Journal of Chinese Computer SystemsDOI : 10.20009/j. cnki. 21-1106/TP. 2020-0924深度联合语义跨模态哈希算法熊威，王展青，王晓雨(武汉理工大学理学院，武汉430070)E-mail : wangzhq@ whut. edu. cn摘要：哈希检索具有存储消耗低、查询速度快等优点，被广泛应用于跨模态检索研究，其中基于深度学习的跨模态哈希方法是热点研究问题.大多数深度哈希方法通常在多模态数据凶特征关联性学习过程中忽略了数据内容的潜在相关性和语义判别性，从而导致哈希码的关联性不强，容易造成原始数据特征和神经网络特征的不兼容问题.针对以上问题，本文提出一种图像- 文本深度联合语义哈希算法(Deep Joint-Semantic Hashing,DJSH ).该方法使用两个神经网络分别提取图像和文本的细粒度特征，并为毎个模态网络设计了哈希层和标签层，分别用于特征学习和标签预测.一方面，通过特征学习模块进行跨模态数据的深度特征学习，提出能够增强数据内容相似度和平衡数据分布的联合语义特征损失，同时基于拉普拉斯约束的图近邻结构能够保留原始数据的相似度排序；另一方面，通过引入标签预测和标签对齐技术，将有判别力的标签信息融入到图像与文本的跨模态网络学习中，确保哈希码的每一位具有不同类别的判别信息.在MIRFLICKR25K ,NUS-WIDE 和IAPR-TC12三个基准数据集上的实验结果表明，该模型较近年来先进飽跨模态检索模型，具有更好的检索效果.关键词：跨模态检索;哈希学习;深度神经网络;细粒度特征;标签对齐;联合语义中图分类号：TP183 文献标识码:A 文章编号:1000-1220(2022)03-0589-09Deep Joint-semantic Hashing for Cross-modal RetrievalXIONG Wei,WANG Zhan-qing ,WANG Xiao-yu(School of Science, Wuhan University of Technology , Wuhan 430070,China)Abstract : Due to the advantages of low storage consumption and fast query speed , hash retrieval method has attracted considerate at tention in cross modal retrieval research. Among them , the cross-modal hashing method based on deep learning is a research hotspot.Most deep hashing methods usually ignore the relevance and semantic discriminative of the data content in the process of feature rele vance learning of multimodal data , which results in the weak relevance of the hash code and the incompatibility between original data features and neural network features. To solve this problem , this paper proposes an image-text deep joint-semantic approach namedDJSH. In DJSH,two deep neural networks are designed to extract fine-grained features of image and text respectively. Specially ,each of them contains a hash layer and a label layer ,which are used for feature learning and label prediction respectively. On the one hand, the loss of joint-semantic features module which can enhance the similarity of the data content and balance the data distribution is in troduced. It can not only maintain the degree of association between instances , but also preserves the similarity rank of the original data by nearest neighbor graph structure based on Laplace constraint. On the other hand , to make the generated hash codes of instanceswhich classified into different categories can be clearly distinguished , label prediction and label alignment techniques are used to ensure that each bit of hash code contains different categories of discrimination information. The experiment results on three real-word com mon datasets demonstrate that DJSH outperforms state ・of-the ・art cross-modal hashing approaches.Key words : cross-modal retrieval ； hashing learning ； deep neural network ； fine-grained feature ； label alignment ； joint-semantic1引言随着互联网信息和科学技术的高速发展，多媒体数据呈现爆炸式增长.为满足人们对多样化数据的需求，跨模态数据的检索技术成为人工智能领域的研究热点.例如，给定一个查询图像，可能需要检索一组最能描述该图像的文本，或者将给定的文本匹配一组在视觉上关联的图像.跨模态检索任务能够高效地分析多模态数据的语义关联性，实现不同模态之间的相互匹配.在信息检索⑴2)、图像分类①"和目标检测⑶等计算机视觉应用中，最近邻 (NN)[6'7]搜索是一种应用广泛的检索技术，能根据特定的距离测量方法从数据库中找到最接近查询样本的数据.对于大规模数据或类型复杂的样本，在数据库中计算查询样本与检索样本之间的距离需要大量的计算.为了降低查找最近邻的代价，近似最近邻(ANN)⑻成为跨模态检索任务中最常用的检索方式.近年来，由于数据的哈希特征收稿日期：2020-1246收修改稿日期：2020-12-26基金项目:湖北省重点基金项目(2015CFA059)资助；中央高校基本科研业务费专项资金(2019ZY232)资助.作者简介:熊威，男，1996年生，硕士研究生，研究方向为图像处理、跨模态检索等;王展青，男，1965年生，博士，教授, 研究方向为模式识别、数字图像处理、信息计算等;王晓雨，女,1997年生，硕士，研究方向为机器学习、跨模态检索等.590小型微型计算机系统2022年表示具有存储空间小和检索速度快、通讯开销低等优点，因此在大规模信息检索领域得到广泛的关注和重视⑼.跨模态检索的关键问题是如何学习不同模态数据之间的内在相关性回•早期基于哈希的跨模态检索方法切通常将手工特征(SIFT、GIST等)投影到汉明空间，并在哈希码的学习过程中保持数据特征的关联性.然而这些哈希方法将特征提取和哈希学习视为两个独立的过程,不能在同一框架内进行特征学习和哈希学习(⑷.随着深度学习的迅速发展,许多基于深度学习的哈希方法被提出.然而，这些方法大多采用由标签信息构成的传统相似性度量,简单的将数据间的关联性分为相似与不相似.一些深度学习方法提出了相似性度量上的改进(如余弦相似度!切和杰卡德相似系数(Jaccard coefficient)[,6i)，取得了检索性能上的提升.由于来自不同模态的数据具有特定的表示形式，因此进一步挖掘数据内容的潜在关联性能够提高跨模态检索模型的性能.为了深入挖掘潜在多模态数据的深度特征信息和空间结构信息，本文提出一种新的深度联合语义模型(DJSH)如图1所示.算法的主要贡献总结如下：1)提出一种端对端的深度联合语义框架,能够充分挖掘跨模态数据的深度特征关联性和原始数据的近邻关系.2)构造能够平衡数据分布的特征学习损失，不仅能判别数据是否相似，还保留了数据内容的相似性.3)引入基于拉普拉斯约束(Laplace Constrain)的图近邻结构,使学习到的哈希码不仅能够保持原始数据的近邻关系，还可以保留原始数据的相似度排序.4)为学习到具有高效鉴别能力的哈希码,通过标签预测和标签对齐技术,使生成的哈希码有不同类别的判别信息.2相关工作根据在训练过程中是否使用标签等先验知识，跨模态哈希方法大致可以分为无监督方法w如和监督方法["-14'22-36]. 2.1无监督跨模态哈希无监督跨模态哈希方法通常从未标记的多模态数据中挖掘模态内和模态间的相关性，并学习原始数据到公共子空间的映射.跨媒体哈希(Inter-Media Hashing,IMH)[,7]利用线性回归模型学习哈希函数,将来自异构数据源的未标记数据映射到一个公共的特征子空间.无监督深度跨模态哈希(unsupervised Deep Cross-Modal Hashing,UDCMH)18利用深度神经网络和矩阵分解建立二元潜在因子模型，并在哈希学习过程中引入Laplace约束.基于字典学习的跨模态哈希(Dictionary Learning Cross-Modal Hashing,DLCMH)L19通过字典学习生成数据的稀疏表示，再投影到潜在的公共子空间中进行哈希学习.深度联合语义重构哈希(Deep Joint-Semantics Reconstructing Hashing,DJSRH严:计算原始数据特征的余弦相似性并构造联合语义一致性矩阵,较好地捕捉未标记实例潜在的语义相关性.循环一致性深度生成哈希(Cycle-consistent deep generative hashing,CYC-DGH)设计生成网络和判别网络，其中生成网络通过数据的概率分布，将任意模态的数据生成另一模态的数据，而判别网络用于判别数据的真假，生成网络和判别网络通过对抗博弈的训练方式提高各自的学习能力.2.1监督跨模态哈希监督跨模态哈希方法通常利用训练数据标签或标签的语义相关性等监督信息来挖掘多模态数据之间的关联.跨视图哈希(Cross-View Hashing,CVH)扩展了单模态视图哈希，针对不同的模态数据学习各自的哈希函数，同时在训练过程中保留了视图内和视图间的关联性.判别式跨模态哈希(Discriminant Cross-modal Hashing,DCH)[23!构造一个具有二进制约束的线性分类器，最小化哈希码与标签之间的线性映射误差.语义最相关性最大化(Semantic Correlation Maximization,SCM)[24；通过构造成对相似度矩阵最大化多模态数据的语义相关性.语义保留哈希(Semantics-Preserving Hashing, SePH)[25]利用原始数据的概率分布构造关联性矩阵，并最小化KL散度(Kullback-Leibler Divergence)进行哈希学习.深度跨模态哈希(Deep Cross-Modal Hashing,DC-MH)[26!结合端对端的学习思想，首次提出特征提取和哈希学习并行的深度框架.自监督对抗哈希(Self-Supervised Adversarial Hashing,SSAH)利用标签信息训练标签语义网络，作为其他网络的监督网络进行哈希学习.基于三元组的深度哈希(Triplet-based Deep Hashing,TDH)[28引入基于三元组的相似度损失函数，既能保留成对数据间的相似性，同时也能捕捉到实例间的差异性.注意力感知的深度对抗哈希(Attention-aware Deep Adversarial Hashing, adah)[29]引入注意机制，用于区分注意区域(前景)与非注意区域(背景)，提取到数据的显著特征.3跨模态检索问题描述本文的算法仅关注图像数据和文本数据的跨模态检索问题.训练数据集包含相互配对的图像、文本和标签，用0=｛m,必川二表示，其中"代表样本对的数量.X=｛如｝二和y =｛y,J二分别表示给定的图像数据和文本数据丄=｛川二是对应的标签.s为样本之间的相似度矩阵，其矩阵元素s$表示第i个样本与第j个样本的相似度，若两个样本存在相同的类标，则s$=1；否则s?=o.图像哈希特征和文本哈希特征在各自模态的网络中进行学习，如式(1)所示：产*=心;&*)[Gj.=g(yj；0y)其中0,和e y分别表示图像网络和文本网络的参数,/(x,.；0t)表示图像网络中哈希层的输出，g(”;色)表示文本网络中哈希层的输岀.图像和文本的跨模态检索模型的主要任务是学习高质量的哈希函数/(X,；0X)和g(”；0)，使得当Sq=1时图像哈希特征F t,和文本哈希特征G”有尽可能一致的表达；而当S”=0时图像哈希特征和文本哈希特征G”的相似性尽可能低.4深度联合语义算法模型4.1模型框架本文提出的深度联合语义框架如图1所示，该框架由两个部分组成：特征学习模块，通过图像和文本网络分别学习具有强关联性的深度图像特征和深度文本特征，并构造图像3期熊威等:深度联合语义跨模态哈希算法591模态和文本模态的邻接矩阵保留原始数据特征的相似度排测标签，并将富含语义信息的标签矩阵对齐到哈希码矩阵中,序;标签预测与对齐模块，能够生成与真实标签维度相同的预提高不同类别实例生成的哈希码的区分能力.标签层输入层卷塑1-5全连接层6-7预测标签原始图像空间图像邻接矩阵10.60.80.610.30.80.31Sydney Operahouse located on ' the -----to visit it,you must / go through or take a $oat >△原始文本空间—O交本血援矩旌I10.20.80.210.50.80.51T,T ：1 1-111哈希码B bird building sea tree car shipO输入层全连接层1-2哈希层! tanh-1 1-111哈希特征sigmoid―*lolllol ill标签L［标签预测损失r「0| 训真实标签iWn J i真实标签：标签戳柿△ZE 簷蠶图1本文的算法框架Fig. 1 Framework of deep joint-semantic hash( DJSH)在图像模态数据的特征学习中，使用的深度神经网络由8个层次组成，包括5个卷积层(convl-conv5)和3个全连接层(fc6-fc8).网络的前7层与CNN-F 完全相同，均使用relu作为激活函，在第7层之后添加一个具有r + c 个隐藏节点的全连接层.其中包含哈希层和标签层，哈希层具有『个隐藏节点,并使用tanh 激活函数生成r 位的哈希特征；标签层则有c 个隐藏节点,并使用sigmoid 作为激活函数生成c 类的预测标签.具体的图像网络设置如表1所示.表1图像网络设置Table 1 Configuration of image modalitylayer configurationconvl k. 64 x 11 x 11 ；s.4 x4,pad 0,LRN , x 2 pool conv2k.256 x5 x5；s. 1 x 1, pad 2 ,LRN , x 2 poolconv3k. 256 x 3 x 3 ；s. 1 x 1 ,pad 1conv4k.256 x3 x3；s. 1 x 1, pad 1conv5k. 256 x3 x 3；s. 1 x 1 , pad 1 , Max Pooling fc64096, Dropout fc7512, Dropout hash/label layerr + c , Dropout对于文本模态的特征学习，使用的深度神经网络是由 3个全连接层组成的深度前馈神经网络.文本网络的输入是由词袋(Bag of Words, BoW )模型提取到的文本表示，经过3个全连接层，输出深度文本特征和预测标签.其中fcl 层的长度等于词向量的长度，fc2层有512个隐藏节点，fc3层是一个具有r + c 个隐藏节点的全连接层.网络的前两层(fcl 、fc2)均使用疋加作为激活函数，最后一层 (fc3)的哈希层和标签层分别使用tanh 和sigmoi d 作为激活函数，分别生成哈希特征和预测标签.具体的文本网络设置如表2所示.表2文本网络设置Table 2 Configuration of textual modalitylayer configuration fcl 4096, Dropoutfc2512, Dropouthash/label layer r + c. Dropout4.2特征学习模块4.2. 1 哈希特征学习深度跨模态哈希算法通常利用标签信息构造数据间的相似度度量,并在高层空间保持特征的关联性,从而学习不同模态的哈希函数.标签语义相似度通常将样本间的关联性描述为相似或不相似，其似然函数的定义如式(2)所示：其中，伽=寺(F “)(G*)T q (x )=托二.为使学习到的深度图像特征和深度文本特征保持跨模态的语义相似性，能够通过跨模态相似度的负对数似然定义的特征学习损失如式 (3)所示：人=,£(log(l+eW)汕础) (3)对于存在相同翼'标的样本,标签语义相似度难以区分其相似程度，因此当训练数据集中相似样本对的比例较高时，该特征学习损失无法有效地胜任特征匹配任务.受文献［16］的启发，杰卡德系数能够有效地反映样本数据内容的相似性.计算方式如式(4)所示：K =-------------------- (4)°' 叫)其中厶卩)是标签矩阵的第心)行=10,11 eR c ，'./V (i592小型微型计算机系统2022年(%)表示/，(厶)中元素1的个数,表示Z,和片对应位置上都有1的个数.血越大，表示样本对内容的相似程度越高.用D”表示图像哈希特征和文本哈希特征G”的汉明空间距离■表示哈希码长度，则哈希码的相似度能够表示为1-乞,因此内容r的相似性损失可以通过式(5)进行计算：n J)A=Z(1-亠&j)2/j=i r1n彳-+〈F“,G门〉=L(1--------------------------------K$•j=\r1其中为=*(F.,)(G*)T.结合式(3)、式(5),模态间的特征关联性损失如式(6)所示：■finer=為(wj(log(I+e*'1)-S阴)+w；a(寺-号-K tJ)2)(6)其中a是参数,W,j和w,是平衡系数，用于解决训练数据分布不平衡的问题，其定义如式(7)所示：jw,j=IS'\/\S baKl,\=l5+\/\s balch\在模型训练过程中,本文的算法模型采用基于批量数据的训练方式.其中表示当前批量样本对的数量，Is-|表示不相似样本对的数量，|s+|表示相似样本对的数量.当训练样本中相似样本对的比例较多时对于学习数据内容相似性的权重叫；就会增加；当训练样本中的不相似样本对的比例较多时,厶”,倾向于判别样本数据的相似关系.厶””,仅考虑了不同模态样本对的相似性，可能导致模型难以衡量相同模态样本之间的关联性.为使生成的哈希特征在公共空间保持相同模态样本的关联性，增强数据模态内的成对相似性，模态内成对的相似性损失如式(8)所示：Jg“=亲］(wj(log(1+-Sqip"+w,；a(*+号-KJ)(8)其中r表示哈希码长度，对于图像模态为=*(F“) (F”)丁;对于文本模态內=寺(G“)(G”)丁.4.2.2相似度排序学习监督哈希方法大多利用基于多标签信息的语义相似度来度量两个实例之间的关联性，而不同模态数据都有特定的表示形式，因此跨模态数据的关联信息可能不只存在于抽象形式.为深入挖掘多模态数据的近邻结构，分别用I 和T表示原始图像和原始文本的近邻矩阵，其矩阵元素通过式(9)进行计算：~II II2II II2II%II2II v y II2式(9)中“,和Uj分别表示第i个图像与第/个图像的SIFT特征,v,和Vj分别表示第i个文本与第/个文本由词袋模型提取到的文本特征.为克服神经网络特征与原始数据特征的不兼容问题,分别为图像模态和文本模态构造特定的拉普拉斯约束(Laplace Constrain)X,/,7||F(.-F r||2和Z，7v II6,.-67.||2,确保生成的哈希码保留原始数据的相似度排序.以图像模态为例，如果厶2>厶3,则在训练过程中耳.与F-的相似程度比尺.与.的相似程度更高.因此拉普拉斯约束能够在哈希学习中保留原始数据的近邻结构，同时保留了原始数据的相似性排序.然而优化拉普拉斯约束项是个离散问题，需要逐一计算批量训练数据的特征距离,因此将拉普拉斯约束改写成式(10)：rZ A II=l,f)IX,7；II G“-G”忙=7>(GT L t G)其中L,=diag(Z,)-/,£;=diag(r,)-7\因此相似度排序损失如式(11)所示：J认=0("(戸L.F)+Tr(G r L T G))(11)其中0是参数.4.2.3联合语义特征损失联合语义特征损失通过基于内容的相似度度量深度挖掘图像、文本数据内容的关联性，并引入基于拉普拉斯约束的图近邻结构保留原始数据特征的相似度排序.因此联合语义特征损失包含图像、文本网络特征的模态间损失、模态内损失和原始数据的相似度排序损失，其能够表示为式(12):joint~inter+intra+rank(12)4.3标签预测与对齐模块受SSAH261的启发，DJSH还为每个模态的神经网络训练一个标签层，能够生成与真实标签维度相同的预测标签.跨模态数据生成的预测标签与真实标签尽可能保持一致，因此预测标签损失定义如式(13)所示：人=||d-Z||；+||九-厶||；(13)大多数深度哈希方法在学习哈希码的过程中仅关注实例间的相似性度量,因而不能保证学习到具有高效鉴别能力的哈希码.为了使不同类别实例生成的哈希码有更好的区分能力，受一些自编码方法的启发，DJSH引入标签对齐技术，将标签中不同类别的判别信息嵌入到哈希码中.具体来说,DJSH额外学习到一个标签矩阵L到哈希矩阵B的稳定线性映射P,使得LP=B.因此标签对齐损失如式(14)所示：厶=(14)标签对齐技术通过学习标签矩阵到哈希码的映射，能够保证哈希码的每一位都具有丰富的类别信息.4.4目标函数为了进一步提高模型的性能，在训练阶段保证图像和文本数据学习到相同的哈希码，因此量化损失如式(15)所示：人=l|B-F||；+(15)本文的算法模型包含特征学习模块和标签预测、对齐模块，综合两个组成成分，本文的目标函数如式(16)所示：J=J”g+NJ*+0S+朮6(16)其中77」和“是平衡参数.3期熊威等:深度联合语义跨模态哈希算法5935算法与优化对于含有2个矩阵变量P,B和两个网络参数e x,e y来说,目标函数式(16)是非凸的，采用交替迭代策略更新各参数.5.12的学习固定参数P,B和色，利用随机梯度下降(Stochastic Gradient Descent,SGD)1331与反向传播算法算法优化参数久，目标函数对F：.和®的梯度如式(17)、式(18)所示：最=器(c(为)务-SgGJ+巴£(*_号_血)•~G i-+空£(o■(嗚)F”-+au；g(y-y1n-K,j)•+巧*+2024(F”-F”)(17)签=2g-L)(18) 5.2$的学习固定参数P.B和利用随机梯度下降与反向传播算法算法优化参数0,目标函数对G：.和九的梯度如式(19)、式(20)所示：急=进£(*知)F”-S声.)+她擒(寺_号-KJ•~F j--S”G”),f,(y-y1n-血)・丄务+20》7；(Gi・-q・)(19)厂j=\乎=2入(—-厶)(20) 5.3P的学习固定其他参数B,侏和0,式(16)可简化为式(21):m严=“(\\B-LPW}+||P忙)s.t.PeR cxr(21)通过计算式(21)的迹，可以得到式(22)：minJ=f i(Tr(B-LP)(B-厶P”+77(P Pj)5.t.PeR cxr(22)式(22)对P的导数可表示为式(23)：第=-2“厶t(b_")+2#(23)令其导数的值等于0,得到P的表达式如式(24)所示：P=/i(L t£+7)'L J B(24) 5.4B的学习固定其他参数P,久和乞，式(16)可以简化为式(25):minJ=v(||B-F||；+||B-G||；)+“(||B-LP||；)s.t.B e|-1,+l|nxr(25)式(25)能够转化为基于迹的问题，如式(26)所示：ra>iJ=77(7r(B T F))+i)(Tr(B T G))+n(Tr(B7LP))=Tr(B T H)s.t.B e|-1,+1|"xr(26)其中H=V F+V G+MP,哈希码矩阵通过式(27)进行更新：B=sign(r)F+t)G+fiLP)(27) 5.5样本外扩展在检索过程中，对于一个不在训练集里的图像数据x艸“,能够通过图像模态网络生成哈希码如式(28)所示：b；*“=sign(/(x query；侏))(28)同样地，对于需要检索的文本数据儿”“，能够通过文本模态网络生成哈希码如式(29)所示：巧s=sig"(g(儿呻;0))(29)深度联合语义算法的具体过程如算法1所示.算法1.深度联合语义跨模态哈希算法输入：图像集X,文本集丫,标签矩阵厶输出：网络参数久和0,线性投影矩阵F和哈希码矩阵B. 1.初始化参数a,0,",儿“,矩阵P和B,设置图像和文本的批量值”*和®，最大迭代数T”心，图像网络和文本网络迭代次数T x和T y2.for i=1to T x do卷和誥3.从训练集中随机取心个图像数据，用式(17)和(18)计算4.通过链式法则和反向传播更新图像网络参数0X5.end for6.for i=1to T y do7.从训练集中随机取®个文本数据，用式(19)和式(20)计算乎-和乎dtr|•8-通过链式法则和反向传播更新文本网络参数0y9.end for10.通过式(24更新)F；11.用式(27)更新B；12.重复步骤2-步骤11直到目标函数达到收敛阈值或达到最大迭代数几*6实验本文在MIRFLICKR25K341.NUS-WIDE⑴；以及IAPR-TC12l36)基准数据集上进行了实验验证，并与最先进的跨模态哈希方法进行检索性能的比较和分析.6.1数据集MIRFLICKR25K：34］:该数据集包含从flickr网站收集的25015张图片.实验中只保留那些至少有20个文本标记的实例，形成20015个图像-文本对.其中文本数据描述为1386维的单词包向量，且每个样本对都用一个或多个标签进行注释,总共有24个语义标签.NUS-WIDE1351:该数据集包含195834幅网络图像和相关的文本标签.每个样本对都带有21个概念标签，文本被表示为一个1000维的词向量，而手工制作的图像特征是一个500维的视觉单词包(bag-of-visual words,BOVW)向量.IAPR-TC121361:该数据集由20000幅图像组成，这些图像来自广泛的领域,如运动和行动、人、动物、城市、景观等.每张图像至少提供一个句子注释，且每个样本对使用275个标签进行注释.为了评估，使用12个最常见概念标签的18715幅图像，然后生成33447个图像句子对.6.2实验细节与评估指标实验环境为Ubuntu1&04,CPU为E5-2670,内存64G,显卡型号为1080Ti11G.在实验中设置参数a=0="=入=“= 1,图像模态网络的学习率为［10-9,10'55］,文本模态网络的学习率为［10",10''］，并从每个数据集中随机取样12000个实例进行实验，其中10000个实例用于训练,2000个实例用于594小型微型计算机系统2022年测试.模型的性能评估指标采用平均精度均值(Mean Average Precision)®和精度-召回率(Precision-Recall)[38).所有的实验都是在pytorch框架下进行，并取3次实验结果的平均值进行展示.平均精度均值(mAP)是信息检索中常用的评估指标，是查询平均精度(AP)的平均值,能够反映检索精度的平均水平,计算方式如式(30)所示：MmAP=--------------(30)M其中,M是查询数据集,AP(qJ是查询数据4的平均精度•精确度的平均值计算如式(31)所示：AP(务)=~^-Yp(r)d(r)(31)其中N是检索数据集里与q,相关的实例数量,R表示数据总量.P(r)为前r个被检索实例的精度.d(r)为指标函数,d(r)= 1表示检索实例与查询实例相关；d(r)=0表示两者不相关.精度-召回率(Precision-Recall,P-R)是哈希查询中常用的重要评估指标，能反映模型在不同召回率下的精度•精度和召回率计算如式(32)所示：P=R=TPTP+FPTPTP+FN(32)其中TP表示检索的相关数据,FP表示检索的不相关数据, FN表示未检索的不相关数据.6.3检索性能比较本文与7种先进的跨模态哈希算法CVH221、STMH"；、SCM24：、SePH251、DCMH旳、SSAH："：,ADAH291进行比较.其中.CVH.STMH、SCM、SePH算法均采用手工特征,其它算法通过深度神经网络提取数据特征.表3展示了不同方法在3个数据集上的图像检索文本和文本检索图像两种任务下的mAP,其中I-T和T-I分别表示图像检索文本和文本检索图像任务.从表中容易看出，深度哈希方法要比非深度哈希方法性能更好.表3本文算法与其他跨模态检索算法的mAP对比Table3mAP comparison of different methods任务方法MIRFLICKR25K NUS-WIDE IAPR-TC1216bit32bit Mbit16bit32bit Mbit16bit32bit64bitCVH〔22]0.5570.5540.5540.3740.3660.3610.3420.3360.330 STMH[,1]0.6130.6210.6270.4710.4860.4940.3770.4000.413 scm[24]0.6710.6820.6850.5400.5480.5550.3690.3660.380I—T SePH[25]0.7120.7190.7230.6030.6130.6210.4440.4560.463 dcmh[26：0.7410.7460.7490.5900.6030.6090.4530.4730.484ADAH〔27]0.7560.7710.7720.6400.6290.6520.5290.5280.544 SSAH〔29]0.7820.7900.8000.6420.6360.6390.5000.5330.553 DJSH0.8300.8340.8510.7750.7820.7890.5980.6110.631 CVH[22]0.5740.5710.5710.3610.3490.3390.3490.3430.337 STMH[11]0.6070.6150.6210.4470.4670.4780.3680.3890.404 scm[24]0.6930.7010.7060.5340.5410.5480.3450.3410.347T—I SePH[25]0.7210.7260.7310.5980.6020.6100.4420.4560.464 DCMH^26-0.7820.7900.7930.6380.6510.6570.5180.5370.546 ADAH a0.7920.8060.8070.6780.6970.7030.5350.5560.564 SSAH〔29]0.7910.7950.8030.6690.6620.6660.5160.5510.570 DJSH0.8320.8370.8550.7600.7670.7710.6090.6360.656在数据集MIRFLICKR25K上,很容易发现与其他方法相比,DJSH的mAP有明显的提升.具体而言，该算法与非深度框架比较，mAP提高了15%-25%；而对于深度框架(DC-MH.SSAH和ADAH)而言，也同样有4%~10%的提升.特别的是，该算法在64位码长下的mAP高达0.851(I-T)和0.855(T t I).在数据集NUS-WIDE和IAPR-TC12下的实验结果显示,DJSH的mAP平均高出其它深度框架0.1左右，由于DJSH较其它算法融入更加丰富的数据内容信息，基于拉普拉斯约束的特征相似度排序能够克服原始数据特征和神经网络特征的不兼容问题，因此学习到的哈希码有更好的语义判别性，更能适应多模态数据的相互检索任务.图2给出了所有比较方法在不同数据集下码长为16的精度-召回率曲线(precision-recall curves).从图2(a)-图2(c)容易看出，DJSH在图像检索文本任务上有明显的优势，在不同召回率下的mAP普遍高于其它方法；从图2(d)-图2(f)同样能够看出，DJSH在文本检索图像任务上具有更高的检索效果，在不同召回率下的mAP普遍高于其它方法.DJSH 算法提出的联合语义特征损失同时考虑了模态间数据特征的关联性损失和模态内数据特征的关联性损失，不仅能够保持多模态数据内容的相似性，而且能够平衡相似样本对与非相似样本对的分布，因此DJSH的精度-召回率曲线更加平滑. 6.4参数敏感性分析图3展示了5个参数在MIRFLICKR25K数据集上的敏感性分析.在实验中，将哈希码长度为16,设置参数的取值范围设置为10.01,0.1,1,10,1001，通过改变其中一个参数值,同时固定其他参数值为1,研究该参数对mAP的影响.通过图3(a)-图3(f)容易看到DJSH算法模型的mAP对参数a, /3,r),\的敏感性较高，其中a和人在1附近达到mAP 的最大。

非易失性内存友好的线性哈希索引——NVM-LH

2021⁃03⁃10计算机应用,Journal of Computer Applications 2021,41(3):623-629ISSN 1001⁃9081CODEN JYIIDU http ：//非易失性内存友好的线性哈希索引——NVM -LH汤晨1，黄国锐2，金培权1*（1.中国科学技术大学计算机科学与技术学院，合肥230001；2.中国人民解放军31002部队，北京100081）（∗通信作者电子邮箱jpq@ ）摘要：非易失性内存（NVM ）因其大容量、持久化、按位存取和读延迟低等特性而受到人们的关注，但它同时也具有写次数有限、读写速度不均衡等缺点。

针对传统线性哈希索引直接在NVM 上实现时会导致大量的随机写操作这一问题，提出了一种新的NVM 友好的线性哈希索引NVM -LH 。

NVM -LH 通过存储数据时的缓存行对齐实现了缓存友好性，同时提出了无日志的数据一致性保证策略。

此外，NVM -LH 还通过优化分裂和删除操作来减少NVM 写操作。

实验结果表明，NVM -LH 在空间利用率上比CCEH 高30%，在NVM 写次数上比CCEH 减少了15%左右，表现了更好的NVM 友好性。

关键词：非易失性内存；动态哈希；线性哈希；缓存行友好性；数据一致性中图分类号：TP392文献标志码：ANVM -LH ：non -volatile memory -friendly linear hash indexTANG Chen 1，HUANG Guorui 2，JIN Peiquan 1*（1.School of Computer Science and Technology ，University of Science and Technology of China ，Hefei Anhui 230001，China ；2.Unit 31002，Chinese People s Liberation Army ，Beijing 100081，China ）Abstract:Non -Volatile Memory （NVM ）attracts people s attention because of its large capacity ，persistence ，bitaddressability and low read latency.However ，it also has some disadvantages ，such as limited writes and asymmetric readingand writing speed.When the traditional linear hash index is implemented directly on NVM ，it will lead to a great number of random write operations.To solve this problem ，a new NVM -friendly linear hash index called NVM -LH （NVM -oriented Linear Hashing ）was proposed.The cache friendliness was achieved by NVM -LH through the cache line alignment duringstoring data.And a log -free data consistency guaranteeing strategy was presented in NVM -LH.In addition ，the split and delete operations were optimized in NVM -LH to minimize the NVM write operations.Experimental results show that NVM -LH outperforms the state -of -the -art NVM -aware hash index CCEH （Cacheline -Conscious Extendible Hashing ）in terms ofspace utilization （30%higher ）and NVM write number （about 15%lower ），showing better NVM -friendliness.Key words:Non -Volatile Memory (NVM);dynamic hashing;linear hashing;cache line friendliness;data consistency引言在过去的数十年中，由于存储密度的限制，动态随机访问内存（Dynamic Random Access Memory ，DRAM ）的容量始终无法超越64GB ，不能满足大数据应用对大容量内存的需求。

深度学习知识：神经网络的稀疏表示

深度学习知识：神经网络的稀疏表示神经网络是一种强大的机器学习工具，它通过一系列神经元和权重之间的连接来构建模型。

目前，神经网络已经在多个领域展现出了强大的应用能力。

但是，神经网络本身也存在一些问题，其中之一就是如何处理稀疏表示的数据。

在本文中，我们将探讨稀疏表示以及神经网络如何处理这种类型的数据。

什么是稀疏表示？稀疏表示是指数据中的许多元素都是0，或者接近于0，而只有少数几个元素具有非零值。

这种情况在实际问题中非常普遍，例如在语音识别中的语音信号就是一种稀疏表示。

如何处理稀疏表示？现代的神经网络通常使用全连接层，在这种情况下，输入数据的每个元素都将连接到每个神经元。

这种方法在处理稠密表示的数据时非常有效，但是，在处理稀疏表示数据时，它可能会导致一些问题。

例如，在处理图像数据时，每个像素都可以被认为是一个输入元素。

然而，在大多数图像中，像素值都非常小，类似于稀疏表示数据。

采用全连接神经网络进行图像分类任务，这将导致非常大的模型大小和处理时间，而且很容易出现过拟合的问题。

因此，处理稀疏表示数据的算法通常需要特定的方法。

其中一种解决方法是采用稀疏编码，这是一种用于处理稀疏表示数据的技术。

稀疏编码是一种无监督学习方法，它通过对数据进行组合来生成一个小的编码向量。

由于编码向量非常小，这种方法可以提高神经网络处理稀疏表示数据的效率。

例如，如果我们用一个稀疏编码将输入数据从1000维降至100维，则神经网络的全连接层将变得小得多，处理速度也将更快。

稀疏编码还有另一个好处，即它可以减少噪声的影响。

如果有许多输入特征都是无效的或没有意义的，那么这些特征将会产生噪声，从而降低神经网络的性能。

稀疏编码可以帮助神经网络过滤掉这些噪音数据，只保留最重要的数据特征。

另外一种方法是使用卷积神经网络。

卷积神经网络是专门针对图像处理、语音处理等领域，它能够对输入进行分层的处理。

卷积神经网络的核心思想是对输入进行卷积操作，然后将结果输入到下一层。

稀疏编码的基本原理和应用

稀疏编码的基本原理和应用稀疏编码是一种在信息处理领域中常用的技术，它通过对输入信号进行压缩表示，从而实现数据的高效存储和传输。

本文将介绍稀疏编码的基本原理和应用。

一、稀疏编码的基本原理稀疏编码的基本原理是利用信号的冗余性，将输入信号表示为一个稀疏向量。

在稀疏编码中，输入信号可以看作是由一组基向量的线性组合构成的。

而稀疏编码的目标是找到一组最优的基向量，使得输入信号在这组基向量下的表示尽可能稀疏。

稀疏编码的过程可以分为两个步骤：字典学习和信号重构。

首先，通过字典学习算法，从训练数据中学习得到一组基向量，这些基向量可以用来表示输入信号。

然后，在信号重构阶段，利用学习得到的基向量对输入信号进行重构，从而得到稀疏表示。

二、稀疏编码的应用稀疏编码在很多领域都有广泛的应用。

以下将介绍一些常见的应用场景。

1. 图像处理稀疏编码在图像处理中有着重要的应用。

通过对图像进行稀疏表示，可以实现图像的压缩和去噪。

在图像压缩中，稀疏编码可以有效地减少图像的存储空间，提高图像的传输效率。

而在图像去噪中，稀疏编码可以将噪声信号表示为稀疏向量，从而实现对图像噪声的抑制。

2. 语音识别稀疏编码在语音识别中也有着广泛的应用。

通过对语音信号进行稀疏表示，可以提取出语音信号的关键特征，从而实现语音的识别和分析。

稀疏编码可以有效地降低语音信号的维度，减少计算量，并提高语音识别的准确率。

3. 数据压缩稀疏编码在数据压缩领域中具有重要的应用价值。

通过对数据进行稀疏表示，可以将数据的冗余信息去除，从而实现数据的高效压缩。

稀疏编码可以将高维数据表示为低维的稀疏向量，大大减少了数据的存储空间和传输带宽。

4. 机器学习稀疏编码在机器学习中也有着广泛的应用。

通过对输入数据进行稀疏表示，可以提取出数据的重要特征，从而实现对数据的分类和预测。

稀疏编码可以通过学习得到的基向量，将输入数据映射到一个低维的稀疏空间，从而减少了特征的维度，提高了分类和预测的准确率。

稀疏编码与稀疏表示的区别与联系

稀疏编码与稀疏表示的区别与联系稀疏编码与稀疏表示是机器学习领域中常用的技术，它们在数据处理和特征提取方面起到了重要的作用。

虽然它们有一些相似之处，但在实际应用中也存在一些区别和联系。

首先，稀疏编码和稀疏表示都是为了处理高维数据而提出的方法。

在高维数据中，往往存在大量冗余和噪声，这给数据处理带来了困难。

稀疏编码和稀疏表示通过压缩数据，提取出其中的有用信息，从而减少冗余和噪声的影响。

稀疏编码是一种数据压缩技术，它通过找到一组基向量，将原始数据表示为这些基向量的线性组合。

与传统的基向量表示不同，稀疏编码要求线性组合的系数是稀疏的，即大部分系数为零。

这样可以有效地减少数据的维度，提取出数据中最重要的特征。

稀疏编码的关键在于如何选择合适的基向量和稀疏表示的方法。

常见的稀疏表示方法包括L1正则化、L0范数和基于字典学习的方法。

通过这些方法，可以将原始数据表示为一个稀疏向量，其中只有少数几个系数是非零的。

稀疏表示是一种特征提取技术，它通过选择一组最能代表原始数据的基向量，将数据表示为这些基向量的线性组合。

与稀疏编码不同的是，稀疏表示不要求线性组合的系数是稀疏的，可以是任意值。

稀疏表示的目标是找到一组基向量，使得使用这些基向量表示的数据能够尽可能接近原始数据。

稀疏表示的关键在于如何选择合适的基向量和表示方法。

常见的稀疏表示方法包括主成分分析（PCA）、独立成分分析（ICA）和奇异值分解（SVD）。

通过这些方法，可以将原始数据表示为一个低维向量，其中每个维度都是原始数据中的一个重要特征。

稀疏编码和稀疏表示在实际应用中有一些联系。

首先，它们都可以用于数据降维和特征提取。

通过选择合适的基向量和表示方法，可以将高维数据表示为低维向量，从而减少计算和存储的开销。

其次，它们都可以用于信号处理和图像处理。

通过稀疏编码和稀疏表示，可以提取出信号和图像中的重要信息，去除噪声和冗余，从而改善信号和图像的质量。

然而，稀疏编码和稀疏表示也存在一些区别。

稀疏特征处理方法

稀疏特征处理方法
稀疏特征处理方法是指将数据集中的特征进行稀疏化处理，从而减少维度和冗余信息，提高模型的效率和准确性。

常用的稀疏特征处理方法包括：
1. 基于 L1 正则化的特征选择方法，即使用 Lasso 等模型对特征进行筛选，选择对目标变量有重要影响的特征，去除对目标变量没有影响的特征。

2. 基于特征哈希的方法，即使用哈希函数将特征映射到一个固定的空间中，避免特征数过多带来的维度灾难和计算复杂度。

3. 基于词袋模型的方法，即针对文本数据的处理，将文本转换为单词的向量表示，同时去除停用词等无关词汇，从而提高文本分类的准确性。

4. 基于特征交叉的方法，即将多个特征进行组合，得到新的特征向量，从而提高模型的表现力和泛化能力。

稀疏特征处理方法在机器学习领域应用广泛，在推荐系统、自然语言处理、图像识别等领域都有重要的应用价值。

- 1 -。

深度学习中的模型解决稀疏数据问题的方法

深度学习中的模型解决稀疏数据问题的方法深度学习（Deep Learning）是一种通过多层神经网络模拟人脑结构来进行模式识别和决策的机器学习方法。

在深度学习中，数据质量对于模型的性能至关重要。

然而，许多实际应用中的数据都存在稀疏性的问题，即大部分特征值都为零。

稀疏数据的问题在深度学习中经常遇到，因为例如在自然语言处理和推荐系统等领域，大多数特征都不会同时出现。

这导致输入的维度非常高，而具有真实意义的特征很少。

为了解决稀疏数据问题，研究人员提出了一些方法。

一、稀疏数据表示方法稀疏数据表示方法是处理稀疏数据最基本的一种方法。

其主要思想是通过适当的数据编码方式将稀疏数据转化为稠密数据。

常见的稀疏数据表示方法包括One-Hot编码、TF-IDF等。

以One-Hot编码为例，该方法将每个特征都编码成一个二进制的向量，向量的长度等于特征空间的维度数。

一个特征只在对应的位置上为1，其他位置为0，从而将稀疏数据编码为稠密数据。

使用稠密数据可以加速训练过程，提高模型的性能。

二、特征选择（Feature Selection）特征选择是另一种用于解决稀疏数据问题的方法。

该方法的主要思想是从原始数据中选择出对目标任务最有用的特征子集。

通过减少特征的维度，可以提高模型的效率和性能。

常用的特征选择方法包括相关系数法、卡方检验法、互信息法等。

这些方法都可以评估特征与目标之间的相关性，从而筛选出与目标任务最相关的特征。

三、嵌入式选择（Embedded Method）嵌入式选择是一种将特征选择与模型训练结合起来的方法。

在模型的训练过程中，嵌入式选择方法会自动选择与目标任务相关的特征，并将其纳入到模型当中。

常见的嵌入式选择方法有L1正则化、决策树等。

以L1正则化为例，该方法会通过对模型的目标函数添加L1惩罚项的方式，鼓励模型选择较少的特征，从而达到特征选择的目的。

四、特征补全（Feature Imputation）特征补全是一种通过预测或估计的方式填补稀疏数据中缺失的特征值。

稀疏与特征提取方法

稀疏与特征提取方法
稀疏与特征提取方法是机器学习中非常重要的两个概念。

稀疏表示是指数据集中存在大量低维表示,而高维表示很少或几乎没有。

特征提取是指从原始数据中提取出有用的特征,以便进行建模。

在深度学习中,稀疏与特征提取方法是相互依存的,因为深度学习模型通常需要大量的高维特征来进行建模。

稀疏表示的方法包括剪枝、量化、稀疏编码等。

剪枝是指通过删除冗余特征来减少特征维度。

量化是指将高维特征映射到低维空间中,以便更好地进行表示。

稀疏编码是指使用低维表示来压缩原始数据,以便在存储和传输时减少带宽消耗。

特征提取的方法包括传统特征提取方法和深度学习特征提取方法。

传统特征提取方法包括统计分析、特征工程等。

深度学习特征提取方法包括卷积神经网络、循环神经网络、自编码器等。

深度学习特征提取方法具有高效、准确、可解释性强等优点,因此越来越受到欢迎。

除了稀疏表示和特征提取方法外,还有一些其他的机器学习方法,例如集成学习、主动学习、迁移学习等,这些方法也可以用于稀疏数据和特征提取。

稀疏表示和特征提取方法是机器学习中非常重要的两个概念。

通过选择合适的稀疏表示和特征提取方法,可以更好地处理稀疏数据和低维特征,从而提高模型的性能和准确度。

随着机器学习的不断发展,稀疏表示和特征提取方法也将在更多的应用领域中得到广泛的应用。

稀疏判别分析

稀疏判别分析摘要:针对流形嵌入降维方法中在高维空间构建近邻图无益于后续工作，以及不容易给近邻大小和热核参数赋合适值的问题，提出一种稀疏判别分析算法（seda）。

首先使用稀疏表示构建稀疏图保持数据的全局信息和几何结构，以克服流形嵌入方法的不足；其次，将稀疏保持作为正则化项使用fisher判别准则，能够得到最优的投影。

在一组高维数据集上的实验结果表明，seda是非常有效的半监督降维方法。

关键词:判别分析；稀疏表示；近邻图；稀疏图sparse discriminant analysischen xiao.dong1*, lin huan.xiang 21．school of information and engineering, zhejiang radio and television university, hangzhou zhejiang 310030, china ;2．school of information and electronic engineering,zhejiang university of science and technology, hangzhou zhejiang 310023, chinaabstract:methods for manifold embedding exists in the following issues: on one hand, neighborhood graph is constructed in such thehigh-dimensionality of original space that it tends to work poorly; on the other hand, appropriate values for the neighborhood size and heat kernel parameter involved in graph construction is generally difficult to be assigned. to address these problems, a novel semi-supervised dimensionality reduction algorithm called sparse discriminant analysis (seda) is proposed. firstly, seda sets up a sparse graph to preserve the global information and geometric structure of the data based on sparse representation. secondly, it applies both sparse graph and fisher criterion to seek the optimal projection. experiments on a broad range of data sets show that seda is superior to many popular dimensionality reduction methods.methods for manifold embedding have the following issues: on one hand, neighborhood graph is constructed in suchhigh.dimensionality of original space that it tends to work poorly; on the other hand, appropriate values for the neighborhood size and heat kernel parameter involved in graph construction are generally difficult to be assigned. to address these problems, a new semi.supervised dimensionality reduction algorithm called sparse discriminant analysis (seda) was proposed. firstly, seda set up a sparse graph topreserve the global information and geometric structure of the data based on sparse representation. secondly, it applied both sparse graph and fisher criterion to seek the optimal projection. the experimental results on a broad range of data sets show that seda is superior to many popular dimensionality reduction methods.key words:discriminant analysis; sparse representation; neighborhood graph; sparse graph0 引言在信息检索、文本分类、图像处理和生物计算等应用中，所面临的数据都是高维的。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Sparse Multi-Modal HashingFei Wu,Zhou Yu,Yi Yang,Siliang Tang,Yin Zhang,and Yueting ZhuangAbstract—Learning hash functions across heterogenous high-di-mensional features is very desirable for many applications in-volving multi-modal data objects.In this paper,we propose an approach to obtain the sparse codesets for the data objects across different modalities via joint multi-modal dictionary learning, which we call sparse multi-modal hashing(abbreviated as). In,both intra-modality similarity and inter-modality sim-ilarity areﬁrst modeled by a hypergraph,then multi-modal dictionaries are jointly learned by Hypergraph Laplacian sparse coding.Based on the learned dictionaries,the sparse codeset of each data object is acquired and conducted for multi-modal approximate nearest neighbor retrieval using a sensitive Jaccard metric.The experimental results show that outperforms other methods in terms of mAP and Percentage on two real-world data sets.Index Terms—Dictionary learning,multi-modal hashing,sparse coding.I.I NTRODUCTIONS IMILARITY search,a.k.a.,nearest neighbor(NN)search, is a fundamental problem and has enjoyed great success in many applications of data mining,database,and information retrieval.With the explosive growth of high-dimensional data, e.g.,the images and videos on the web,there is an emerging need of the NN search on high-dimensional feature space.The problem of NN search can be described as follows:given a query data object,ﬁnding the top-nearest neighbors to the query from a target data set.The simplest way to solve the NN search problem is the brute-force linear search.However,this becomes prohibitively expen-sive when the number of retrieved target data objects are very large scale.To speed up the process ofﬁnding relevant data ob-jects to a query,indexing techniques are necessarily conducted to organize target data objects.However,some studies pointed out that many indexing methods have an exponential depen-dence(in space or time or both)upon the number of dimension and even the simple brute-force linear search method may beManuscript received April19,2013;revised July17,2013;accepted September08,2013.Date of publication November14,2013;date of cur-rent version January15,2014.This work was supported by973Program (No.2012CB316400),NSFC(61070068,90920303,61103099),863program (2012AA012505),Chinese Knowledge Center of Engineering Science and Technology(CKCEST),and China Academic Digital Associative Library (CADAL).The associate editor coordinating the review of this manuscript and approving it for publication was Dr.Chong-Wah Ngo.F.Wu,Z.Yu,S.Tang,Y.Zhang,and Y.Zhuang are with the College of Computer Science and Technology,Zhejiang University,Hangzhou,China (e-mail:wufei@;yuz@;siliang.tang@; zhangyin98@;yzhuang@).Y.Yang is with ITEE,University of Queensland,Brisbane,Australia(e-mail: yi.yang@.au).Color versions of one or more of theﬁgures in this paper are available online at .Digital Object Identiﬁer10.1109/TMM.2013.2291214more efﬁcient than an index-based search in high-dimensional settings[1].A promising way to speed up the similarity search is the hashing technique.It makes a tradeoff between accuracy and ef-ﬁciency and relaxes the nearest neighbor search to approximate nearest neighbor(ANN)search.The principle of the hashing method is to map the high-dimensional data objects into com-pact hash codes so that similar data objects have the same or similar hash codes.The key of the hashing-based ANN search methods is the design of the hash functions.The most well-known one is Lo-cality Sensitive Hashing(LSH)[2],[3],which uses random pro-jections to obtain the hash functions.However,due to the in-trinsical property of random projection,to guarantee good re-trieval performance,LSH usually needs a quite long hash code and hundreds of hash tables.To make the hash code more com-pact,several data-dependent learning based methods are pro-posed.Weiss et al.propose Spectral Hashing(SH)[4]which assumes the uniform distribution of data objects in training set and uses eigenfunction to obtain the hash pared with LSH,SH achieves a better performance since the learned hash functions capture the manifold of the data set.Since then, many extensions of SH have been proposed such as[5]–[8]. Nowadays,many real-world applications involve multi-modal data objects,where information inherently consists of data objects with different modalities,such as a web image with loosely related narrative text descriptions,and a historic news report with paired text and images.How to learn the latent correlation of different modalities and devise a cross-media retrieval algorithm is a very worthwhile research direction to be explored.Here,cross-media retrieval means giving a query from modality A(e.g.,image modality)returning the most rele-vant results from modality B(e.g.,textual modality).The basic idea of cross-media retrieval is to discover the correlations between paired multi-modal data and some techniques such as canonical correlation analysis[9],manifold learning[10], [11],structural learning[12]can be exploited.However,most of them do not mainly focus on the efﬁciency of retrieval and may be infeasible for large scale data sets.Therefore,devising one multi-modal hashing(also known as cross-media hashing) algorithm for fast cross-media retrieval is of great importance. In the past years,only limited attempts have been made,e.g., Cross Modal Similarity Sensitive Hashing(CMSSH)[13], Cross View Hashing(CVH)[14]and Multi Latent Binary Embedding(MLBE)[15].How to faithfully preserve both intra-modality similarity and inter-modality similarity with compact codes is fundamental for the hashing of multi-modal data objects,a natural ques-tion to ask is whether we can design an algorithm of hashing of multi-modal data objects with two following aspects:1)the formulation of hash function may well utilize the similarities of intra-modality and inter-modality and achieve a great coding1520-9210©2013IEEE.Personal use is permitted,but republication/redistribution requires IEEE permission.See /publications_standards/publications/rights/index.html for more information.power;and2)the similarity of obtained hash codes should be computed efﬁciently.Motivated by the fact that dictionary learning(DL)methods have the intrinsic power of dealing with the heterogeneous features by generating different dictionaries for multi-modal data objects[16],[17],this paper is dedicated to developing a hashing method for multi-modal data objects based on multi-modal dictionary learning method,i.e.,simultaneous generation of the sparse coefﬁcients for the data objects from multiple modalities(e.g.,images and texts),which we call sparse multi-modal hashing(abbreviated as).is formulated by coupling the multi-modal dictionary learning(in terms of approximate reconstruction of each data object with a weighted linear combination of a small number of“basis vectors”or“dictionary atoms”)and a regularized hypergraph penalty(in terms of the modeling of multi-modal correlation). Different from other hashing approaches,attempts to activate the most relevant component and induce a compact codeset for each data from its corresponding sparse coefﬁcients. This characteristic enables all hashing bits to be fully utilized since each hashing bit only needs to be effective for certain data points.Although our codeset-based is different from the hamming embedding hashing methods,they actually bear some resemblance:generating a compact representation of each high-dimensional data object and performing an efﬁcient ANN search.To make both the similarities of intra-modality and inter-modality well preserved by the compact codesets of data ob-jects,hypergraph is utilized to model the correlations between multi-modal data and enforced as a regularizer during multi-modal dictionary learning.As a result,the sparse coefﬁcients of each data object can faithfully encode the intra and inter similar-ities of each data object with other(homogeneous or heteroge-neous)data objects instead of purely reconstructive one.That is to say,similar data objects will have similar sparse coefﬁcients.A hashing scheme is conducted to activate those informative (relevant)component indices of sparse coefﬁcients of each data object.In this way,we can obtain a compact codeset of each data object,which provides a more compact and interpretable representation of each data object.To the best of our knowledge,only one existing method, called Robust Sparse Hashing(RSH)[18],adopts the idea of hashing with dictionary learning.The difference between RSH and our proposed method is that RSH is limited to the uni-modal data objects while ours is performed to the multi-modal data objects.Furthermore,our method also inte-grates some other aforementioned appealing characteristics that make the generated sparse codesets more applicable for ANN retrieval of multi-modal data objects.The main contributions of are two-fold:•The intra-similarity and inter-similarity between multi-modal data objects are explicitly well leveraged when learning the multi-modal dictionaries by a hyper-graph penalty which further improves the performance of ANN retrieval of multi-modal data objects.•Since a compact codeset is learned for each data ob-ject rather than the traditional hamming binary codes in other hashing approaches,an appropriate distance metric, namely the sensitive Jaccard distance,is employed for efﬁcient ANN search here.The rest of the paper is organized as follows:In Section II,we review the related work of multi-modal hashing.In Section III, we give out the overview of our proposed.The optimiza-tion details are demonstrated in Section IV.Experimental results and comparisons on two real-world data sets are demonstrated in Section VI.Finally,the conclusion and future work are given in Section VII.II.R ELATED W ORKFastﬁnding the similar data objects to a given query from a large scale database is critical to content-based information retrieval.Given a query,the naive solution to accuratelyﬁnd the examples that are most similar to the query is to search over all of the data objects in database and sort them according to their similarities to the query.However,this becomes prohibitively expensive when the scale of the database is very large,therefore indexing techniques are required to accelerate the efﬁciency of the retrieval[19],[20].In recent years,hashing-based methods for large-scale simi-larity search have sparked considerable research interests in the data mining and machine learning communities.For example, Locality Sensitive Hashing(LSH)and its variations have been proposed as indexing approaches for ANN search[21],[22]. However,LSH could be unstable and may lead to extremely bad result for a small number of hash bits and hash tables.Therefore, the number of hash bits and hash tables required may be large in some cases in order to achieve a good performance in LSH. Unlike those approaches which randomly project the input data objects into an embedding space such as LSH,some machine learning(data-aware)approaches were recently im-plemented to generate more accurate hash codes,such as Semantic Hashing[23],Spectral Hashing(SH)[4],Self-taught Hashing[5],Spline Regression Hashing(SRH)[24],Random Maximum Margin Hashing[25],LDAHash[26],Bit Selection Hashing[27],Compact Hyperplane Hashing[28],etc.All of these approaches attempt to elaborate appropriate hash func-tions to transform original high-dimensional data objects into compact binary codes.Besides,quantization method such as Iterative quantization[29],Product quantization[30],etc.,can also be adopted to conduct hashing alike efﬁcient retrieval. The approaches mentioned above explicitly or implic-itly focus on the hashing of data objects with homogeneous features.Nevertheless,in the real world,we can extract het-erogenous features from each data object.Taken the images as examples,we can extract many of visual features from images such as global features(color,shape and texture)or local features(SIFT,Shape Context and GLOH(Gradient Location and Orientation Histogram).Therefore,we can take each kind of visual features as a view of images.Since different views (visual features)have their own speciﬁc statistical properties, different visual features may have different discriminative powers to characterize one given image.In computer vision and multimedia research,some approaches have shown that leveraging information contained in multiple views potentially has an advantage over only using a single view[31],[32]. Multiple Feature Hashing(MFH)is therefore proposed in [33],[34]to learn the binary code of each data object with heterogenous features.MFH can preserve the local structures of each homogenous features and also globally consider the structures for all of heterogenous features.WU et al.:SPARSE MULTI-MODAL HASHING429Fig.1.The algorithmic ﬂowchart of our proposed .For the sake of illustrative simplicity,we assume only two kinds of data objects (i.e.,images and texts)here.A hypergraph is ﬁrst constructed to model the correlations between multi-modal data objects,then the multi-modal dictionaries are jointly learned to obtain one image dictionary and one text dictionary respectively.Each data object can be succinctly represented using a limited corresponding dictionary atoms and the corresponding sparse coef ﬁcients.Finally,the hashing scheme is conducted to identify those signi ﬁcantly informative component (i.e.,the sparse codes with large coef ﬁcients).The selected component indices are used to construct a sparse codeset for each data object.We can observe the sparse codesets well preserve both intra-modality similarity and the inter-modality similarity.For examples,two “dinosaur”images have the same sparse codeset,and two “dinosaur”images have similar sparse codesets with their relevant text (dinosaur,ancient and fossil,etc.).On the contrary,two “dinosaur”images have apparently different sparse codesets with their irrelevant text (sport,football,etc.).In this paper,we focus on performing cross-modality similarity retrieval for multi-modal data objects.In recent years,some hashing approaches for multi-modal data objects have been proposed,such as CMSSH [13],CVH [14]and MLBE [15].The problem of multi-modal hashing has been initiated by Bronstein et al.in CMSSH [13].Speci ﬁcally,given two kinds of data objects,CMSSH learns two groups of hash functions to ensure that if two data objects (with different modalities)are rel-evant,their corresponding hash codes are similar and otherwise dissimilar.However,CMSSH only preserves the inter-modality similarity but ignores the intra-modality similarity.Kumar et al.extend Spectral Hashing [4]to the multi-modal scenario and propose CVH [14].CVH attempts to generate the hash codes by minimizing the distance of hash codes for sim-ilar data objects and maximizing the distance for dissimilar data objects.The inter-view and intra-view similarities are well con-ducted in CVH.Zhen et al.propose a probabilistic latent factor model,called multi modal latent binary embedding (MLBE)in [15],to learn hash functions for multi-modal retrieval.MLBE employs a gen-erative model to encode the intra-similarity and inter-similarity of data objects across multiple modalities.Based on maximum a posteriori estimation,the binary latent factors are ef ﬁciently obtained and then taken as hash codes in MLBE.As stated before,this paper is interested in the sparse multi-modal hashing by multi-modal dictionary learning.Althoughour proposedbears some resemblance to robust sparse hashing (RSH)[18]that adopts the idea of hashing with dictio-nary learning,we extend the idea from uni-modal data objects into multi-modal data objects.III.T HE A LGORITHM O VERVIEW OFIn this section,we introduce the details of .Fig.1il-lustrates the algorithmic ﬂowchart of our proposed .For the sake of illustrative simplicity,we assume only two kinds ofdata objects (i.e.,images and texts)are available in Fig.1.The proposed mainly consists of following four parts:•Modeling of multi-modal correlation :In Fig.1,different kinds of data objects can be uniformly viewed as vertices in the hypergraph.The homogeneous hyperedges are uti-lized here to connect similar homogenous vertices,i.e.,similar images or similar texts.The heterogeneous hyper-edges are conducted to connect image vertices with their similar text vertices.A weight is assigned to each hyper-edge according to its importance in the hypergraph.Hence,the intra-modality similarity and inter-modality similarity are well preserved in the hypergraph.•Multi-modal dictionary learning :The constructed hy-pergraph is employed as a regularizer on multi-modal dictionary learning and we can obtain one image dictionary and one text dictionary jointly.Given a data object with any modality,we can represent the data object as a weighted linear combination of a small number of corresponding “basis vectors”or “dictionary atoms”.Concretely,each date object is succinctly represented using a limited dictionary atoms and a sparse vector of weights (sparse coef ﬁcients).•Out-of-sample extension :As we have obtained the optimal multi-modal dictionaries,we can ef ﬁciently compute the sparse coef ﬁcients for a new data object from the arbitrary modality using its corresponding dictionary.•Hashing Scheme :Sparse coef ﬁcients of each data object is used to generate its compact representation by the hashing scheme.The hashing scheme encourages those signi ﬁcantly informative component indices (i.e.,indices of the sparse codes with large coef ﬁcients)are selected out (activated).The selected component indices are used to construct a sparse codeset.Here,we use the sparse codeset as the hash code of each data object.Then a sensitive Jaccard distance is employed to ef ﬁciently perform ANN search.430IEEE TRANSACTIONS ON MULTIMEDIA,VOL.16,NO.2,FEBRUARY2014 In Fig.1,we can observe the sparse codesets well preserveboth the intra-modality similarity and the inter-modality sim-ilarity.For examples,two“dinosaur”images have the samesparse codeset,and two“dinosaur”images have similar sparsecodesets with their relevant text(dinosaur,ancient and fossil, etc.).On the contrary,two“dinosaur”images have apparently different sparse codesets with their irrelevant text(beef,food, lunch,etc.).A.NotationsTo simplify our presentation,we use the special case with two modalities of data objects in this paper,however,our has an inherent extension ability to more than two modalities.We name these two modalities and.Assume that we have two datasets from modality and from modality,respectively.Let,be the two data matrices.and denote the number of data objects in and,and denote the dimensionality of two modalities (usually,).B.The Modeling of Multi-Modal CorrelationTo well model the complex relationship(i.e.,intra-similarity and inter-similarity)between and,we resort to the hyper-graph used in[35],[36].Let denote the weighted hypergraph where is the set of vertices,is the set of hyperedges,is the weights for each hyperedge.The degree of an edge is,that is,the cardinality of.The degree of a vertexis.Let andbe two matrices consisting of the degrees of hyperedges and vertices,respectively,and be the diagonal matrix consisting of the weights of hyperedges.The incidence matrix has its entry if(a special case is)and0otherwise.Speciﬁcally,the hypergraph is constructed in this paper as follows:the vertex set.Then,over the vertex set,the intra-similarity and inter-similarity are encoded by homogeneous and heterogeneous hyperedges respectively.Homogenous hyperedges:the intra-modality similarity between the data objects in or is encoded in homogeneous hyperedges.To well preserve the manifold structure between the data objects with the same modality,the“probabilistic”hypergraph same as[37]is used here.We take each vertex as a“centroid”vertex and each hyperedge is formed by a centroid and its-nearest neighbours.The incidence matrixis deﬁned as follows[37]:ifotherwise(1)Here denotes the similarity of two where is the average distance between all the vertices.The weight for each hyperedge is computed as follows[37]:(2)This indicates that the closer the near-neighbors are from the “centroid”,the higher weight thehyperedge will have.TABLE IT HE O VERALL I NCIDENCE M ATRIX FOR THE M ODELING OFB OTH I NTER-S IMILARITY AND I NTRA-S IMILARITY IN M ULTI-M ODALD ATA.,,D ENOTE THE C ORRESPONDING E DGE S ETSAND,,D ENOTE THE S UB I NCIDENCE M ATRIXThe probabilistic hypergraph encodes not only the local grouping information,but also the probability that a vertex belongs to a hyperedge.In this way,the correlation between intra-modality data objects is accurately described.In this paper,for dataset,weﬁrst compute-nearest neigh-bors of each data object in respectively to obtain the hyper-edge set.Then,the weight for each hyperedge is computed by(2).Finally,the incidence matrix,named as,is constructed.A similar procedure is conducted to build hyperedge set and incidence matrixfor dataset.Heterogeneous hyperedges:the inter-modality similarity across the data objects between and can be encoded in hy-peredge set.Here,the elements in denote the inter-modality similarity between the data objects from and. The inter-modality similarity can be obtained according to spe-ciﬁc applications.Since the inter-modality similarity is hard to be quantiﬁed to real values,same as[15],the binary values are used to indicate the relationships among the data objects across different modalities in this paper,i.e.,if two data objects have s similar relationship,,otherwise,. For examples,a web image and its corresponding loosely re-lated narrative text can be regarded to have a similar relation-ship,and their similarity will be set to1.The weight for each is set to the same parameter indicating the signif-icance inter-modality similarity and will tune in the experiments.The overall incidence matrixof our constructed hy-pergraph is shown in Table I.Its worth noting that a hypergraph has the intrinsic power to generically model the high-order relationship among the data objects from more than two modalities and does not add any additional computation overload as the number of modalities increases.C.The Joint Learning of Multi-Modal DictionariesThe modeling of data objects with the linear combinations of a few dictionary atoms from a learned dictionary has been the focus of much recent research[38],[39].The essential chal-lenge to be resolved in dictionary learning is to develop an ef-ﬁcient approach with which each data object can be approxi-mately reconstructed from a“optimal dictionary”with a“sparse coefﬁcients”.Suppose is the learned dictionary from and is the corresponding sparse coefﬁcients of data objects in,is the dimensionality of,is the number of data objects in and is the size of the dictionary respectively.WU et al.:SPARSE MULTI-MODAL HASHING431Similarly,is the learned dictionary from and indi-cates the corresponding sparse coefﬁcients of data objects in, is the dimensionality of,is the number of data objects in and is the size of the dictionary respectively.Here the dictionary sizes of and are set to a same value(i.e.,). The objective function of learning multi-modal dictionaries can be formulated as follows:(3) where is the joint sparse coefﬁcients of and and.shown in(3)is the imposed penalty over the sparse coefﬁcients of all of data objects.Typically,the-norm[40] is conducted as a penalty to explicitly encourage sparsity on each sparse coefﬁcients.The traditional-norm is to the sparse coefﬁcients of each data object independently.Therefore,the similar data objects can be encoded to have totally different sparse coefﬁcients.Since the multi-modal correlation is en-coded into the hypergraph we constructed above,it is natural to preserve the inter-modality and intra-modality similarities be-tween sparse coefﬁcients by imposing a hypergraph penalty in(3)[41].Therefore,the in(3)can be deﬁned as follows:(4) where is the trace norm of a matrix.It can be observed that consists of two parts:theﬁrst part is a traditional-norm penalty to encourage the sparsity of sparse coefﬁcients,and the second part ensures that for every vertex and in hyperedge,the distance of and is consistent with the distance of their corresponding sparse ﬁcients.Recall that homogenous hyperedges and heteroge-nous hyperedges are both constructed,we therefore capture the intra-similarity or inter-similarity among the data objects during the multi-modal dictionary learning.In(4),is called as the normalized hypergraph laplacian and deﬁned as[42]:(5) where is the identity matrix.According to[41]and also observed in our experiments,the hypergraph laplacian regularized dictionary learning can guar-antee the sparse coefﬁcients corresponding to the data objects connected by a same hyperedge are similar to each other.It is worth noting that the sparsity of sparse coefﬁcients in and is likely different due to the heterogeneity in multi-modal data objects.However,in this paper,the sparsity of sparse coefﬁcients in and is expected to be on the same level. Such similar assumption is also enforced in coupled dictionary learning(CDL)for image super-resolution in[43],which sug-gests that one pair of image patches from different domains(low resolution v.s.high resolution)has the same dictionary atoms. As a result,different weights and are set to guarantee and hold alike sparsity.In order to achieve such different level of sparsity over multi-modal data,(4)is reformulated as follows:(6) where is a diagonal matrix deﬁned as:(7) Since is one invariant diagonal matrix,it does not add any additional overload to the solution of(3).The detailed optimiza-tion of is shown in next section.D.Out-of-Sample Extension Using the Learned Dictionaries For each data object in the training data set,we have obtained its corresponding sparse coefﬁcients.To compute the sparse co-efﬁcients of the out-of-sample data objects,the learned dictio-naries are exploited.Assume that given a new data object from modality( from modality is identical),using the,we obtain the sparse coefﬁcients of as follows:(8) An alternative strategy is to add the graph regularization term into the sparse coefﬁcients learning like[41]:(9) where is the number of training samples from modality, is the sparse coefﬁcients for,is the similarity between and.Since the sparse coefﬁcients of the out-of-sample data objects will be online computed which is strictly restricted to the com-putation complexity and the time cost for(9)is about70100 times comparing with(8)even with a small and.In our experiments,the cross-modal retrieval performance using(9) has little improvement over the one with(8),as the correla-tions between intra-modality and inter-modality are faithfully preserved in the learned dictionaries and.Therefore,we use(8)rather than(9)to generate the sparse coefﬁcients for the out-of-sample data objects in our experiments.(8)is a basic lasso problem and can be solved efﬁciently by LARS method[44].Moreover,LARS method has a beneﬁt that the sparse degree(number of nonzero elements)of the output can be well controlled.This is helpful since we expect the of coefﬁcients on the same level for all of multi-modal data.E.Hashing Scheme of Sparse CodesetsAssume that the sparse coefﬁcients for both data objects in training sets and the out-of-sample data objects are obtained,we tend to devise a hashing scheme to generate the sparse codeset of each data in order to perform efﬁcient ANN search on a large-scale data set.Kong et al.have pointed out that the quantization method of simply thresholding the linear projected data objects into binary hamming codes may lead to serious information loss[45].As。