15. Keyword-based Correlated Network Computation over Large Social Media-关键字

合集下载

人工智能领域中英文专有名词汇总

人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱(共包含二级节点15 个,三级节点93 个)间序列分析)监督学习)领域 二级分类 三级分类。

学术文本词汇功能识别——基于BERT向量化表示的关键词自动分类研究

学术文本词汇功能识别——基于BERT向量化表示的关键词自动分类研究

情报学报2020年12月第39卷第12期Journal of the China Society for Scientific and Technical Information,Dec.2020,39(12):1320-1329学术文本词汇功能识别——基于BERT向量化表示的关键词自动分类研究陆伟1,2,李鹏程1,2,张国标1,2,程齐凯1,2(1.武汉大学信息管理学院,武汉430072;2.武汉大学信息检索与知识挖掘研究所,武汉430072)摘要关键词作为学术文本中映射全文主题内容的词汇或术语,能够为知识精准检索和文本大规模计算提供重要的底层语义标签。

当前学术文本中的关键词存在使用意图不明、语义功能模糊及上下文信息缺失等问题。

为此,本文提出了一种基于有监督学习的神经网络方法,对关键词所承载的语义功能进行分类,实现对学术文本中研究问题和研究方法的识别。

本文以计算机等领域为期10年的学术期刊论文为训练语料,利用BERT及LSTM方法构建分类模型,实验结果显示,本文所提出的方法较传统更优,其整体准确率、召回率和F1值分别达到0.83、0.87和0.85。

关键词学术文本;关键词;语义功能识别;深度学习Recognition of Lexical Functions in Academic Texts:Automatic Classification of Keywords Based on BERT VectorizationLu Wei1,2,Li Pengcheng1,2,Zhang Guobiao1,2and Cheng Qikai1,2(1.School of Information Management,Wuhan University,Wuhan430072;2.Institute for Information Retrieval and Knowledge Mining,Wuhan University,Wuhan430072)Abstract:As vocabulary or terminology that maps the full-text subject matter content in academic texts,keywords can provide important underlying semantic labels for knowledge retrieval and large-scale text computation.At present,there are problems in the use of keywords in academic texts,such as unclear intention,fuzzy semantic function,and lack of con‐text information.Therefore,a neural network method based on supervised learning is proposed to classify the semantic functions carried by keywords to facilitate the identification of research questions and research methods in academic texts.In this study,journal papers published during a period of10years in the field of computer science were used as the training corpus,and the classification model was constructed using BERT and LSTM models.The results show that the proposed method is better than the traditional method.Its overall accuracy,recall rate,and F1value reached0.83,0.87,and0.85.Key words:academic text;keywords;lexical function recognition;deep learning1引言随着科研社区规模的快速扩张和学术文献数量的急剧增长,从海量的学术论文中快速、精准的获取知识越发困难。

基于信息抽取技术的问答系统

基于信息抽取技术的问答系统

基于信息抽取技术的问答系统于根;李晓戈;刘睿;范贤;杜丽萍【摘要】After analyzing the entity relation and the named entity,a hierarchical method was proposed.The problem was divided into entity relation type,named entity type and keyword type.The answer set was collected according to the entity relation,the entity and the keyword,and the final answer was obtained by ranked features that included basic features,matching named entity and matching entity relation.Experimental results based on the NLPCCEVAL2015 show that the accuracy rate reaches 54.05% when using named entities and entity relations,the result is increased by 6.1% compared to that considering basic features only.%通过分析实体关系和命名实体,提出基于层次的答案提取方法.在将问题分为实体关系型、实体型和关键词型3类的基础上,按照实体关系层、实体层、关键词层得到答案集,利用基础特征、命名实体匹配和实体关系匹配进行特征排序提取答案.基于NLPCCEVAL2015的测试结果表明,在考虑命名实体和实体关系的情况下,准确率比仅使用基础特征的情形提高了6.1%,达到54.05%.【期刊名称】《计算机工程与设计》【年(卷),期】2017(038)004【总页数】5页(P1051-1055)【关键词】问答系统;信息抽取;实体关系;命名实体;层次法【作者】于根;李晓戈;刘睿;范贤;杜丽萍【作者单位】西安邮电大学计算机学院,陕西西安710121;西安邮电大学计算机学院,陕西西安710121;西安邮电大学计算机学院,陕西西安710121;西安邮电大学计算机学院,陕西西安710121;西安邮电大学计算机学院,陕西西安710121【正文语种】中文【中图分类】TP391目前问答系统[1,2]中最先进的答案提取方法分为基于语义分析的答案提取方法和基于信息抽取的答案提取方法。

EntropyRank:基于主题熵的关键短语提取算法

EntropyRank:基于主题熵的关键短语提取算法

EntropyRank:基于主题熵的关键短语提取算法尹红; 陈雁; 李平【期刊名称】《《中文信息学报》》【年(卷),期】2019(033)011【总页数】8页(P107-114)【关键词】关键短语提取; 随机游走; 主题模型; 词语影响力【作者】尹红; 陈雁; 李平【作者单位】西南石油大学计算机科学学院智能与网络化系统研究中心四川成都610500【正文语种】中文【中图分类】TP3910 引言互联网的快速发展,使得文档数量迅速增长。

大量文档可为知识获取带来红利,但同时也给信息获取带来了挑战。

关键短语反映了文档主题或主要内容,能够帮助用户快速理解文档并获取文档关键信息,它携带的重要信息可广泛用于自然语言处理和信息检索领域任务上,例如,聚类、分类、自动摘要、话题监测、问答系统、知识发现等[1-4],因此人们一直对关键短语提取技术有着极大兴趣,并提出了大量新颖的方法。

关键短语提取方法主要沿着监督和无监督两个方向不断推进[5]。

在监督方法中,通常将关键短语提取作为一个二分类问题[6-7],利用机器学习中的分类算法学习分类模型。

在无监督方法中,常见做法是将其作为一个基于图的排序问题,把单个文档转换为图,利用具体的排序算法[8-9]对节点进行排序,取前K个短语作为关键短语。

由于监督方法需要大量的人工标注数据,而人工标注是一个耗时耗力的任务,因此本研究集中于无监督方法上。

传统的基于图的方法通过词语间的紧密程度和词语本身在文档中具有的影响力来计算词语得分。

当前,在衡量词汇间的紧密程度方面,研究者们大都采用词汇间的语义关系以及共现频率来表示词语间的紧密性;在词语影响力方面,研究者们大多利用词的位置信息以及词对主题的偏好程度来表示词语本身的影响力。

对于如何表示词对主题的偏好程度,目前研究都局限于利用词的主题分布与文档主题分布的相似度来表示。

据我们所知,词的主题分布是针对整个语料集,不因文档的不同而产生差异,而每篇文档中的相同词可能对相同的主题有不同的偏好性,因此现有的方法不能准确的表示词对主题的偏好程度。

机器学习与数据挖掘笔试面试题

机器学习与数据挖掘笔试面试题
What is a decision tree? What are some business reasons you might want to use a decision tree model? How do you build a decision tree model? What impurity measures do you know? Describe some of the different splitting rules used by different decision tree algorithms. Is a big brushy tree always good? How will you compare aegression? Which is more suitable under different circumstances? What is pruning and why is it important? Ensemble models: To answer questions on ensemble models here is a :
Why do we combine multiple trees? What is Random Forest? Why would you prefer it to SVM? Logistic regression: Link to Logistic regression Here's a nice tutorial What is logistic regression? How do we train a logistic regression model? How do we interpret its coefficients? Support Vector Machines A tutorial on SVM can be found and What is the maximal margin classifier? How this margin can be achieved and why is it beneficial? How do we train SVM? What about hard SVM and soft SVM? What is a kernel? Explain the Kernel trick Which kernels do you know? How to choose a kernel? Neural Networks Here's a link to on Coursera What is an Artificial Neural Network? How to train an ANN? What is back propagation? How does a neural network with three layers (one input layer, one inner layer and one output layer) compare to a logistic regression? What is deep learning? What is CNN (Convolution Neural Network) or RNN (Recurrent Neural Network)? Other models: What other models do you know? How can we use Naive Bayes classifier for categorical features? What if some features are numerical? Tradeoffs between different types of classification models. How to choose the best one? Compare logistic regression with decision trees and neural networks. and What is Regularization? Which problem does Regularization try to solve? Ans. used to address the overfitting problem, it penalizes your loss function by adding a multiple of an L1 (LASSO) or an L2 (Ridge) norm of your weights vector w (it is the vector of the learned parameters in your linear regression). What does it mean (practically) for a design matrix to be "ill-conditioned"? When might you want to use ridge regression instead of traditional linear regression? What is the difference between the L1 and L2 regularization? Why (geometrically) does LASSO produce solutions with zero-valued coefficients (as opposed to ridge)? and What is the purpose of dimensionality reduction and why do we need it? Are dimensionality reduction techniques supervised or not? Are all of them are (un)supervised? What ways of reducing dimensionality do you know? Is feature selection a dimensionality reduction technique? What is the difference between feature selection and feature extraction? Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not? and Why do you need to use cluster analysis? Give examples of some cluster analysis methods? Differentiate between partitioning method and hierarchical methods. Explain K-Means and its objective? How do you select K for K-Means?

快速贝叶斯基因网络构建算法

快速贝叶斯基因网络构建算法

快速贝叶斯基因网络构建算法刘飞【摘要】Inferring the gene regulatory network is a major challenge in computational biology.During past decades, a lot of numerous computational approaches have been introduced for inferring the GRNs.Bayesian network methods cannot handle large-scale networks due to their high computational complexity, while information theory-based methods suffer from false positive/negative problems.To overcome the limitations, we present a novel algorithm.The algorithm first uses sequential conditional mutual information to construct initial networks.Then, the restriction of the maximum number parents for each gene is employed to generate gene regulatory network.The algorithm is tested on realistic biological networks and in silico networks of different sizes and topologies, and it outperforms other state-of-the-art methods.The results indicate that not only effectively reduces the computational cost due to much smaller sizes of local GRNs, but also considerably improves the precision of network inference.%基因网络构建是计算生物学一个很重要的研究领域,近年来涌现出大量推断基因网络构建的计算模型,各种模型方法都有自己的优缺点,如贝叶斯网络模型方法可以得出网络的最优结构,但是因其过高的计算时间复杂度只能应用于小规模网络;信息论的方法可以处理高维低样本数据,但构建出的基因网络中有过多的假阳性边.为了克服这些缺陷,提出了一种新的方法,该方法首先使用有序条件互信息构建基因调控的子网络,然后根据基因调控网络的拓扑先验知识,利用贝叶斯方法找出最优网络结构.该算法在计算机人工合成网络和真实生物分子网络上进行验证分析,其性能超过了现在流行的一些方法,试验结果表明,该方法不仅有较低的时间计算复杂度,而且也取得了较好的基因调控网络构建精度.【期刊名称】《新技术新工艺》【年(卷),期】2017(000)005【总页数】4页(P37-40)【关键词】生物信息学;贝叶斯网络;基因调控网络;条件互信息【作者】刘飞【作者单位】宝鸡文理学院物理与光电技术学院,陕西宝鸡 721016【正文语种】中文【中图分类】TP391从大规模基因表达数据中反向推断出基因调控网络(Gene Regulatory Networks,GRNs)的拓扑结构是系统生物学中一个很重要的研究领域。

卷积神经网络机器学习外文文献翻译中英文2020

卷积神经网络机器学习相关外文翻译中英文2020英文Prediction of composite microstructure stress-strain curves usingconvolutional neural networksCharles Yang,Youngsoo Kim,Seunghwa Ryu,Grace GuAbstractStress-strain curves are an important representation of a material's mechanical properties, from which important properties such as elastic modulus, strength, and toughness, are defined. However, generating stress-strain curves from numerical methods such as finite element method (FEM) is computationally intensive, especially when considering the entire failure path for a material. As a result, it is difficult to perform high throughput computational design of materials with large design spaces, especially when considering mechanical responses beyond the elastic limit. In this work, a combination of principal component analysis (PCA) and convolutional neural networks (CNN) are used to predict the entire stress-strain behavior of binary composites evaluated over the entire failure path, motivated by the significantly faster inference speed of empirical models. We show that PCA transforms the stress-strain curves into an effective latent space by visualizing the eigenbasis of PCA. Despite having a dataset of only 10-27% of possible microstructure configurations, the mean absolute error of the prediction is <10% of therange of values in the dataset, when measuring model performance based on derived material descriptors, such as modulus, strength, and toughness. Our study demonstrates the potential to use machine learning to accelerate material design, characterization, and optimization.Keywords:Machine learning,Convolutional neural networks,Mechanical properties,Microstructure,Computational mechanics IntroductionUnderstanding the relationship between structure and property for materials is a seminal problem in material science, with significant applications for designing next-generation materials. A primary motivating example is designing composite microstructures for load-bearing applications, as composites offer advantageously high specific strength and specific toughness. Recent advancements in additive manufacturing have facilitated the fabrication of complex composite structures, and as a result, a variety of complex designs have been fabricated and tested via 3D-printing methods. While more advanced manufacturing techniques are opening up unprecedented opportunities for advanced materials and novel functionalities, identifying microstructures with desirable properties is a difficult optimization problem.One method of identifying optimal composite designs is by constructing analytical theories. For conventional particulate/fiber-reinforced composites, a variety of homogenizationtheories have been developed to predict the mechanical properties of composites as a function of volume fraction, aspect ratio, and orientation distribution of reinforcements. Because many natural composites, synthesized via self-assembly processes, have relatively periodic and regular structures, their mechanical properties can be predicted if the load transfer mechanism of a representative unit cell and the role of the self-similar hierarchical structure are understood. However, the applicability of analytical theories is limited in quantitatively predicting composite properties beyond the elastic limit in the presence of defects, because such theories rely on the concept of representative volume element (RVE), a statistical representation of material properties, whereas the strength and failure is determined by the weakest defect in the entire sample domain. Numerical modeling based on finite element methods (FEM) can complement analytical methods for predicting inelastic properties such as strength and toughness modulus (referred to as toughness, hereafter) which can only be obtained from full stress-strain curves.However, numerical schemes capable of modeling the initiation and propagation of the curvilinear cracks, such as the crack phase field model, are computationally expensive and time-consuming because a very fine mesh is required to accommodate highly concentrated stress field near crack tip and the rapid variation of damage parameter near diffusive cracksurface. Meanwhile, analytical models require significant human effort and domain expertise and fail to generalize to similar domain problems.In order to identify high-performing composites in the midst of large design spaces within realistic time-frames, we need models that can rapidly describe the mechanical properties of complex systems and be generalized easily to analogous systems. Machine learning offers the benefit of extremely fast inference times and requires only training data to learn relationships between inputs and outputs e.g., composite microstructures and their mechanical properties. Machine learning has already been applied to speed up the optimization of several different physical systems, including graphene kirigami cuts, fine-tuning spin qubit parameters, and probe microscopy tuning. Such models do not require significant human intervention or knowledge, learn relationships efficiently relative to the input design space, and can be generalized to different systems.In this paper, we utilize a combination of principal component analysis (PCA) and convolutional neural networks (CNN) to predict the entire stress-strain c urve of composite failures beyond the elastic limit. Stress-strain curves are chosen as the model's target because t hey are difficult to predict given their high dimensionality. In addition, stress-strain curves are used to derive important material descriptors such as modulus, strength, and toughness. In this sense, predicting stress-straincurves is a more general description of composites properties than any combination of scaler material descriptors. A dataset of 100,000 different composite microstructures and their corresponding stress-strain curves are used to train and evaluate model performance. Due to the high dimensionality of the stress-strain dataset, several dimensionality reduction methods are used, including PCA, featuring a blend of domain understanding and traditional machine learning, to simplify the problem without loss of generality for the model.We will first describe our modeling methodology and the parameters of our finite-element method (FEM) used to generate data. Visualizations of the learned PCA latent space are then presented, a long with model performance results.CNN implementation and trainingA convolutional neural network was trained to predict this lower dimensional representation of the stress vector. The input to the CNN was a binary matrix representing the composite design, with 0's corresponding to soft blocks and 1's corresponding to stiff blocks. PCA was implemented with the open-source Python package scikit-learn, using the default hyperparameters. CNN was implemented using Keras with a TensorFlow backend. The batch size for all experiments was set to 16 and the number of epochs to 30; the Adam optimizer was used to update the CNN weights during backpropagation.A train/test split ratio of 95:5 is used –we justify using a smaller ratio than the standard 80:20 because of a relatively large dataset. With a ratio of 95:5 and a dataset with 100,000 instances, the test set size still has enough data points, roughly several thousands, for its results to generalize. Each column of the target PCA-representation was normalized to have a mean of 0 and a standard deviation of 1 to prevent instable training.Finite element method data generationFEM was used to generate training data for the CNN model. Although initially obtained training data is compute-intensive, it takes much less time to train the CNN model and even less time to make high-throughput inferences over thousands of new, randomly generated composites. The crack phase field solver was based on the hybrid formulation for the quasi-static fracture of elastic solids and implementedin the commercial FEM software ABAQUS with a user-element subroutine (UEL).Visualizing PCAIn order to better understand the role PCA plays in effectively capturing the information contained in stress-strain curves, the principal component representation of stress-strain curves is plotted in 3 dimensions. Specifically, we take the first three principal components, which have a cumulative explained variance ~85%, and plot stress-strain curves in that basis and provide several different angles from which toview the 3D plot. Each point represents a stress-strain curve in the PCA latent space and is colored based on the associated modulus value. it seems that the PCA is able to spread out the curves in the latent space based on modulus values, which suggests that this is a useful latent space for CNN to make predictions in.CNN model design and performanceOur CNN was a fully convolutional neural network i.e. the only dense layer was the output layer. All convolution layers used 16 filters with a stride of 1, with a LeakyReLU activation followed by BatchNormalization. The first 3 Conv blocks did not have 2D MaxPooling, followed by 9 conv blocks which did have a 2D MaxPooling layer, placed after the BatchNormalization layer. A GlobalAveragePooling was used to reduce the dimensionality of the output tensor from the sequential convolution blocks and the final output layer was a Dense layer with 15 nodes, where each node corresponded to a principal component. In total, our model had 26,319 trainable weights.Our architecture was motivated by the recent development and convergence onto fully-convolutional architectures for traditional computer vision applications, where convolutions are empirically observed to be more efficient and stable for learning as opposed to dense layers. In addition, in our previous work, we had shown that CNN's werea capable architecture for learning to predict mechanical properties of 2Dcomposites [30]. The convolution operation is an intuitively good fit forpredicting crack propagation because it is a local operation, allowing it toimplicitly featurize and learn the local spatial effects of crack propagation.After applying PCA transformation to reduce the dimensionality ofthe target variable, CNN is used to predict the PCA representation of thestress-strain curve of a given binary composite design. After training theCNN on a training set, its ability to generalize to composite designs it hasnot seen is evaluated by comparing its predictions on an unseen test set.However, a natural question that emerges i s how to evaluate a model's performance at predicting stress-strain curves in a real-world engineeringcontext. While simple scaler metrics such as mean squared error (MSE)and mean absolute error (MAE) generalize easily to vector targets, it isnot clear how to interpret these aggregate summaries of performance. It isdifficult to use such metrics to ask questions such as “Is this modeand “On average, how poorly will aenough to use in the real world” given prediction be incorrect relative to some given specification”. Although being able to predict stress-strain curves is an importantapplication of FEM and a highly desirable property for any machinelearning model to learn, it does not easily lend itself to interpretation. Specifically, there is no simple quantitative way to define whether two-world units.stress-s train curves are “close” or “similar” with real Given that stress-strain curves are oftentimes intermediary representations of a composite property that are used to derive more meaningful descriptors such as modulus, strength, and toughness, we decided to evaluate the model in an analogous fashion. The CNN prediction in the PCA latent space representation is transformed back to a stress-strain curve using PCA, and used to derive the predicted modulus, strength, and toughness of the composite. The predicted material descriptors are then compared with the actual material descriptors. In this way, MSE and MAE now have clearly interpretable units and meanings. The average performance of the model with respect to the error between the actual and predicted material descriptor values derived from stress-strain curves are presented in Table. The MAE for material descriptors provides an easily interpretable metric of model performance and can easily be used in any design specification to provide confidence estimates of a model prediction. When comparing the mean absolute error (MAE) to the range of values taken on by the distribution of material descriptors, we can see that the MAE is relatively small compared to the range. The MAE compared to the range is <10% for all material descriptors. Relatively tight confidence intervals on the error indicate that this model architecture is stable, the model performance is not heavily dependent on initialization, and that our results are robust to differenttrain-test splits of the data.Future workFuture work includes combining empirical models with optimization algorithms, such as gradient-based methods, to identify composite designs that yield complementary mechanical properties. The ability of a trained empirical model to make high-throughput predictions over designs it has never seen before allows for large parameter space optimization that would be computationally infeasible for FEM. In addition, we plan to explore different visualizations of empirical models-box” of such models. Applying machine in an effort to “open up the blacklearning to finite-element methods is a rapidly growing field with the potential to discover novel next-generation materials tailored for a variety of applications. We also note that the proposed method can be readily applied to predict other physical properties represented in a similar vectorized format, such as electron/phonon density of states, and sound/light absorption spectrum.ConclusionIn conclusion, we applied PCA and CNN to rapidly and accurately predict the stress-strain curves of composites beyond the elastic limit. In doing so, several novel methodological approaches were developed, including using the derived material descriptors from the stress-strain curves as interpretable metrics for model performance and dimensionalityreduction techniques to stress-strain curves. This method has the potential to enable composite design with respect to mechanical response beyond the elastic limit, which was previously computationally infeasible, and can generalize easily to related problems outside of microstructural design for enhancing mechanical properties.中文基于卷积神经网络的复合材料微结构应力-应变曲线预测查尔斯,吉姆,瑞恩,格瑞斯摘要应力-应变曲线是材料机械性能的重要代表,从中可以定义重要的性能,例如弹性模量,强度和韧性。

基于pairwise核的蛋白质相互作用对称预测研究--优秀毕业论文可复制黏贴

博士学位论文基于pairwise核的蛋白质相互作用对称预测研究RESEARCH ON SYMMETRIC PREDICTION OF PROTEIN-PROTEIN INTERACTIONS BASED ON PAIRWISE KERNELS于建涛哈尔滨工业大学2011年6月国内图书分类号:TP391.2 学校代码:10213 国际图书分类号:681.37 密级:公开工学博士学位论文基于pairwise核的蛋白质相互作用对称预测研究博士研究生:于建涛导 师:郭茂祖教授申请学位:工学博士学科:人工智能与信息处理所在单位:计算机科学与技术学院答辩日期:2011年6月授予学位单位:哈尔滨工业大学Classified Index: TP391.2U.D.C: 681.37Dissertation for the Doctoral Degree in EngineeringRESEARCH ON SYMMETRIC PREDICTION OF PROTEIN-PROTEIN INTERACTIONS BASEDON PAIRWISE KERNELSCandidate:Yu JiantaoSupervisor:Prof. Guo MaozuAcademic Degree Applied for:Doctor of EngineeringSpecialty:Artificial Intelligence andInformation Processing Affiliation:School of Computer Sciences andTechnologyDate of Defence:June, 2011Degree-Conferring-Institution:Harbin Institute of Technology摘要摘 要蛋白质是生命活动的直接执行者,蛋白质之间的相互作用是蛋白质实现其功能的重要途径之一,因此构建蛋白质相互作用(protein-protein interaction, PPI)网络是了解分子生物功能、洞悉细胞生命规律的前提,也是研究生物体内疾病的产生与发展、进而从事药物分子靶标识别的关键。

Silicon Labs Simplicity Studio 5 用户指南说明书

Tech Talks LIVE Schedule –Presentation will begin shortlyFind Past Recorded Sessions at: https:///support/trainingFill out the survey for a chance to wina BG22Thunderboard!TopicDateBuilding a Proper Mesh Test Environment: How This Was Solved in Boston Thursday, July 2Come to your Senses with our Magnetic Sensor Thursday, July 9Exploring features of the BLE Security Manager Thursday, July 23New Bluetooth Mesh Light & Sensor Models Thursday, July 30Simplicity Studio v5 IntroductionThursday, August 6Long Range Connectivity using Proprietary RF Solution Thursday, August 13Wake Bluetooth from Deep Sleep using an RF SignalThursday, August 20Silicon Labs LIVE:Wireless Connectivity Tech Talks Summer SeriesWELCOMESilicon Labs LIVE:Wireless Connectivity Tech TalksSummer SeriesIntroduction to Simplicity Studio 5August 6th, 2020https:///products/development-tools/software/simplicity-studio/simplicity-studio-5What is Simplicity Studio 5?§Free Eclipse Based Development Environment§Designed to support Silicon Labs IoTPortfolio§Provides Access to Target Device-SpecifiedWeb & SDK Resources§Software & Hardware Configuration Tools§Integrated Development Environment (IDE)§Eclipse based, C/C++ IDE§GNU ARM Toolchain§Advanced Value Add Tools§Network Analysis, Code Correlated EnergyProfiling, Configuration Tools, etc.The Data Driving Simplicity Studio 5?Simplicity StudioGecko SDK Dev Guides, TutorialsAPI RMsRef Manuals,Datasheets, ErrataStacks, Gecko Platform,Examples, Demos,metadataHardware KitBoard IDSimplicity Studio 5 -LauncherOn the Welcome PageYou Can•Select Target Device •Start a New Project •Access Support Resources and EducationalPressing ‘Welcome’ onthe tool bar will return to Welcome page at any time.1. Welcome & Target SelectionThis is a “get started” section to help with device or board selection12342. Debug Adapters Area shows connected debug adapters including Silicon Lab kits, Segger, J-Link, etc…3. My ProductsEditable list of products you may wish to use as target devices 4. MenuMenu & Tool bar provide access to a number of functions and shortcutsLauncher Perspective -Overview1. General InformationGI card shows debugger,debugger mode, firmwareversions for adapter andsecurity, SDK12 342. Recommended QSGs Quick links to recommended quick start guides for selected product.3. BoardBoard shows which evaluation board is being used and provides easy access to its documentation.4. Target PartTarget part shows full part number and also provides easy access to its documentationLauncher Perspective –Example Projects1. Technology Filter Keyword Filter box and Technology Type check boxes let you dial into the example you are looking for.122. Resource ListResource list will show corresponding example projects that are intended for your selectedtechnology and target device.Launcher Perspective –Documentation1. Resource FilterKeyword Filter box andResource Type checkboxes let you dial into theresource you are lookingfor (Data Sheet, App Note,Errata, QSG, etc…).1232. Technology TypeTechnology check boxesnarrow your search basedon a give technology(Bluetooth, Bootloaders,Thread, Zigbee, etc…).3. ResourcesList of resources that willnarrow as you selectfilters (Data Sheet, AppNote, Errata, QSG, etc…).Launcher Perspective –Demos1. Demo FilterDemo Filter allows you to narrow your search of demos for your selected device.122. DemosList of Pre-compiled demos that are ready to be programmed into your selected device.Launcher Perspective –Compatible Tools1. Compatible ToolsLaunching pad for toolssuch as Hardware1Configurator, NetworkAnalyzer, Device Console,Energy Profiler, etc…Launcher Perspective –Tools –Pin Configuration ToolPin Configuration ToolSimplicity Studio 5 offers aPin Configuration Tool thatallows the user to easilyconfigure new peripheralsor change the propertiesof existing ones. Thegraphical view will differbased on the chip beingused.Simplicity Studio 5 -IDEIDE –Overview1. Tool Bar & MenuLaunching pad for tools 123452. Project ExplorerDirectory structure and allfiles associate with theproject.4. Editor & ConfiguratorsCode editing windows andconfigurators forproject/technologies.3. Debug AdaptersShows connecteddebuggers and EVKs5. Additional WindowsProblems, Search, CallHierarchy, ConsoleIDE –Project Configurator Overview1. Target & SDKAllows user to changedevelopment target andSDK.1232. Project DetailsCan change import mode& force generation.3. Project GeneratorsAllows user to modifywhat files are beinggenerated by projec tIDE –Project Configurator Software Components1. ComponentExpand components to see categories and sub-categories.1232. Selected Component View details of a given component. Gearindicates a configurable component. Check marks show installed components.3. Filters & KeywordsHelp you to search various component categoriesIDE –Configurators (GATT)1. GATT Configurator View, Add, Remove GATT Profiles, Services, Characteristics, and Descriptors122. GATT EditorAllows user to view & modify settings in the Profiles, Services, Characteristics and Descriptors within the GATT.IDE –Configurators (Editing the GATT)EDIT (Device Name)From GATT Configuratorclicking on an editableitem such as device namewill open up a newwindow allowing the userto see content that can beedited and several optionsfor that content that canbe selected/de-selected.Simplicity Studio 5 -MigrationSimplicity Studio -Developer Options*Bugfixes provided per software longevity commitment (https:///products/software/longevity-commitment )Developer ProjectExisting ProjectNew Project GSDK v2.7.x *Simplicity Studio 4*GSDK v3.x.x Simplicity Studio 5Secure VaultFinal Developer BinariesSS4/GSDK2.7x Continuance Option *MigrationToolkitProcess and availability varies by technologyGSDK v2.7.x *Simplicity Studio 4*SS5/GSDK3.x Upgrade OptionLIMITED SUPPORTSubject to longevity commitmentGS D K 2.x t o 3.x = M aj o r C h a n g e .Simplicity Studio –Project StructureBluetooth SDK v2.x Project StructureBluetooth SDK v3.x Project StructureProject Structure There is a change in project structure from GSDK v2.x to GSDK v3.x.It’s now much easier to see which file can be modified by the generator and it’s easier to find/identify the configuration files This is important because withGSDKv3.x many more files are generated by the addition of software components.Simplicity Studio –BGAPI CommandsBGAPI CommandBGAPI CommandsBGAPI Commandschange both theirname and theirstructure to make theerror checking andhandling of returnvalues simpler.Simplicity Studio –Changes to BGAPI CommandsChanges to BGAPI Commands With manycommands, renaming means only changing gecko_cmd_ to sl_bt_.Other functions have been renamed due to changes infunctionality, changed API class, or simply to make the functions more logical.Some API functions have been split into multiple ones while others have been merged.123Simplicity Studio 5 -DemoSimplicity Studio 5 -LinksSimplicity Studio –Useful LinksSimplicity Studio 5https:///products/development-tools/software/simplicity-studio/simplicity-studio-5 Simplicity Studio 5 User Guidehttps:///simplicity-studio-5-users-guide/latest/indexQuick Start Guide Bluetooth SDK v3.xhttps:///documents/public/quick-start-guides/qsg169-bluetooth-sdk-v3x-quick-start-guide.pdfTransitioning from Bluetooth SDK v2.x to v3.xhttps:///documents/public/application-notes/an1255-transitioning-from-bluetooth-sdk-v2-to-v3.pdfBluetooth SDK 3.0.0.2 Release Noteshttps:///documents/public/release-notes/bt-software-release-notes-3.0.0.2.pdfThe Largest Smart Home Developer EventS E P T E M B E R9 –1 0, 2 0 2 0Immerse yourself in two days of technical training designedespecially for engineers, developers and product managers.Learn how to"Work With" ecosystems including Amazon and Google and join hands-on classes on how tobuild door locks, sensors, LED bulbs and more.Don't miss out, register today!w o r k s w i t h.s i l a b s.c o mThank you…..Questions? 。

聚类算法英文专业术语

聚类算法英文专业术语1. 聚类 (Clustering)2. 距离度量 (Distance Metric)3. 相似度度量 (Similarity Metric)4. 皮尔逊相关系数 (Pearson Correlation Coefficient)5. 欧几里得距离 (Euclidean Distance)6. 曼哈顿距离 (Manhattan Distance)7. 切比雪夫距离 (Chebyshev Distance)8. 余弦相似度 (Cosine Similarity)9. 层次聚类 (Hierarchical Clustering)10. 分层聚类 (Divisive Clustering)11. 凝聚聚类 (Agglomerative Clustering)12. K均值聚类 (K-Means Clustering)13. 高斯混合模型聚类 (Gaussian Mixture Model Clustering)14. 密度聚类 (Density-Based Clustering)15. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)16. OPTICS (Ordering Points To Identify the Clustering Structure)17. Mean Shift18. 聚类评估指标 (Clustering Evaluation Metrics)19. 轮廓系数 (Silhouette Coefficient)20. Calinski-Harabasz指数 (Calinski-Harabasz Index)21. Davies-Bouldin指数 (Davies-Bouldin Index)22. 聚类中心 (Cluster Center)23. 聚类半径 (Cluster Radius)24. 噪声点 (Noise Point)25. 簇内差异 (Within-Cluster Variation)26. 簇间差异 (Between-Cluster Variation)。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Keyword-based Correlated Network Computation over Large Social MediaAbstract—Recent years have witnessed an unprecedented pro-liferation of social media, e.g.,millions of blog posts,micro-blog posts,and social networks on the Internet.This kind of social media data can be modeled in a large graph where nodes represent the entities and edges represent relationships between entities of the social media.Discovering keyword-based correlated networks of these large graphs is an important primitive in data analysis,from which users can pay more attention about their concerned information in the large graph.In this paper,we propose and define the problem of keyword-based correlated network computation over a massive graph. To do this,wefirst present a novel tree data structure that only maintains the shortest path of any two graph nodes,by which the massive graph can be equivalently transformed into a tree data structure for addressing our proposed problem.After that,we design efficient algorithms to build the transformed tree data structure from a graph offline and compute theγ-bounded keyword matched subgraphs based on the pre-built tree data structure on thefly.To further improve the efficiency, we propose weighted shingle-based approximation approaches to measure the correlation among a large number ofγ-bounded keyword matched subgraphs.At last,we develop a merge-sort based approach to efficiently generate the correlated networks. Our extensive experiments demonstrate the efficiency of our algorithms on reducing time and space cost.The experimental results also justify the effectiveness of our method in discovering correlated networks from three real datasets.Index Terms—Social Media,Correlated Networks,Keyword Query,Large GraphI.I NTRODUCTIONRecent years have seen an astounding growth of networks in a wide spectrum of application domains,ranging from sensor and communication networks to biological and social networks.And it becomes especially apparent as far as the great surge of popularity for Web2.0applications is con-cerned,such as Facebook,LinkedIn,Twitter and Foursquare. Typically,these networks can be modeled as large graphs with nodes representing entities and edges depicting relationship between entities[1].To retrieve interesting information from large graphs,users often type in keywords as a request,which is known as keyword search over graph data.The problem of keyword search over graph data has been studied extensively, e.g.,[2],[3],[4],[5],[6],[7],[8].Most of existing works return top-k or all minimally matched subgraphs to the users. However,sometimes users not only want to see the individu-ally matched subgraphs,but also expect to see bigger pictures consisting of multiple individual results with high correlations among them.Thisfinds many applications that need to identify a dense part of a large network such as social media according to a specific topic defined by a set of keywords.For example, new events often happen every day in the world,which would lead to lots of discussions from related web pages/blogs or in social networks on the Internet.As another example, companies often launch advertisements before bringing their new products into market.The launched advertisements also receive comments from thousands of micro-blogs or persons on social media.Discovering such correlated networks among these related web pages/blogs or in social networks is helpful to analyze the influence,consequence and scope of the new events or products related to the given keywords.Therefore, in this paper we define and study the problem of so-called keyword-based correlated network computation over a large graph.To do this,we can apply the dense subgraph discovering metrics to model our problem of keyword-based correlated network computation over a large graph.That is to say,the density of grouping the components into correlated networks can be utilized to evaluate the correlation of the components where each component is a full keyword matched subgraph and the maximal distance of any two keyword nodes in each component is bounded by a user-specified valueγ.Dense subgraph discovering techniques have been studied in[9],[10], [11],[12],[13],[14],[15],[16],[17].However,all these works concentrate their research on the efficiency of discovering the densest subgraph,or the top-k dense subgraphs from a graph. There is no existing work to allow users tofind their interested dense subgraphs with a search request(e.g.,a keyword query).Fig.1.An Example of Graph Data GJianxin Li1, Chengfei Liu2, Md. Saiful Islam3Swinburne University of TechnologyMelbourne, Australia{1jianxinli, 2cliu, 3mdsaifulislam}@.auExample1:Figure1provides a part of large social network where the graph nodes may represent persons,or micro-blog posts in social network,and the graph edges may represent friendships of persons,or the communication relationships of these micro-blog posts over the social network.In addition, the keywords at the side of the nodes may represent the personal description information or the contents of the shared stories to be published in the micro-blog posts.Assume a user would like to see the correlated networks with the topic related to the set of keywords{k1,k2,k3}.To control the size of each component(subgraph)in the discovered network, the user can specify a parameterγto bound the maximal distance of any two keyword nodes in each component.As such,our problem is tofind a set of correlated networks where each network consists of multiple components(subgraphs) with high correlation above a threshold and eachγ-bounded component of the network must contain the full keywords. While there are lots of studies on searching subgraphs based on keywords,and efficiently discovering dense subgraphs,no existing work is available to take both aspects into account in the network analysis scenario.It is a big challenge to discover correlated networks with the consideration of the given keywords over a graph,especially over a large graph. An easy way is tofirst compute the dense subgraphs using the conventional dense subgraph discovering techniques and thenfilter the dense subgraphs by checking if they contain the full keywords.For instance,we can compute the densest subgraph by using a representative approach in[11].But often the densest subgraph does not contain the full keywords.It has to try the next densest subgraph untilfinding the right one with the full keywords or probing all the nodes in the graph.At one moment,even if wefind a dense subgraph that contains the full keywords,we have to check the remaining nodes by repeating the above operation until all possible dense subgraphs with the density above a threshold have been completely identified. Figure2shows3densest subgraphs in the graph shown in Figure1by applying for the conventional dense subgraph discovering approaches.“or”in Figure2means only one of the first two subgraphs can be generated because a node can only appear in one densest subgraph based on[11].It is obvious that they are false candidates because they do not contain the full keywords.Therefore,it has to repeatedly try the other subgraphs with less densities.Fig.2.The False Dense Subgraphs for{k1,k2,k3}over GSometimes it is not easy to give a suitable density thresholdvalue to select the dense subgraphs in a large graph.Thiscase is discussed in[12]by proposing top-k dense subgraphdiscovering approach.However,both methods in[11]and[12]are not suitable to deal with our proposed problem in thispaper.This is because(1)lots of unnecessary time is spenton the computation of the false dense subgraph candidatesthat do not contain the full keywords;(2)the components ineach dense subgraph candidate have to be detected;(3)thecomputation of the distance between the keyword nodes ineach component cannot be avoided.Therefore,although theyare efficient to compute general dense subgraphs,they are notsuitable to our problem where we expect to reply user’s on-linerequest within short response time.In addition,we allow thecorrelated networks have overlaps in this paper,which is notallowed in traditional dense subgraph discovering methods.This is mainly because an entity in a social network maytake important roles in different sub-networks at the same timewhile these sub-networks may not have strong correlation.Another easy way is to incrementally generate a temporarysubgraph by scanning the graph nodes one by one and checkthe density and the full keywords of the temporary subgraph.However,this method is also infeasible because even if atemporary subgraph can be identified as a false candidatedue to low density,the nodes in the temporary subgraph maystill have chance to involve in the other dense subgraphs.Forinstance,if we read in v1,v2,v3,and v4,then the densityof the subgraph consisting of these four nodes is66%wherewe use the simple density metrics(e.g.,2∗|E||V|∗(|V|−1))in[9],[11],[12].However,the nodes v2,v3,and v4can constructa more dense subgraph(density=83%)with another nodev7.Therefore,the incremental-based method does not workbecause the density metrics are not monotonic.To address this challenging problem,in this paper weapply the semantics in the metrics[15],[16]to measure thecorrelation of the components in a network.The semantics ofthe metrics originates from the clique definition in which twonodes belonging to a clique share all nodes in the clique wherea clique subgraph is a fully connected subgraph with its densityas one.Obviously,if given two nodes share many adjacencynodes,then they have high probability of belonging to a densesubgraph.Based on the observation,wefirst compute theγ-bounded keyword matched subgraphs as the components overthe large graph.And then,we measure the correlations ofany two components by checking their overlapped neighbornodes.Finally,we generate the correlated networks whereeach network consists of more than one component and theconnection nodes among them.Different from previous graphkeyword search approaches,in this paper we are requiredto efficientlyfind the maximal covering keyword matchedsubgraphs bounded byγ,not top-k minimal subgraphs.Ourstudy can be considered as a complementary work thatfillsin the gap between graph keyword search and dense subgraphdiscovering.The contributions of our work can be summarized asfollows:•We propose and define a new problem of keyword-basedcorrelated networks computation based on the extendedsemantics of clique.•We design a new data structure,by which the graph data can be equivalently transformed into tree data with regard to computing theγ-bounded maximal covering keyword matched subgraphs.•We develop a weighted shingling approach to improve the performance of discovering the correlated networks for a set of keywords over a large graph.•We evaluate our methods on a variety of graph data and the experimental results demonstrate the effectiveness of our correlated network model and efficiency of evaluation algorithms.The rest of this paper is organized as follows.We define our problem of keyword-driven correlated network computation in Section II.Section III provides our solution overview. In Section IV,we design a new data structure,develop an efficient algorithm of computing theγ-bounded maximal cov-ering subgraphs,and present a weighted shingling algorithm to discover the correlated networks.Extensive experimental evaluations are provided in Section V.At last,we review the related work in Section VI and conclude the paper in Section VII.II.C ORRELATED N ETWORK M ODELMany networks in real applications can be modeled as graphs.Given a graph G(V,E)which consists of the vertex set V and the edge set E,we are interested in identifying the correlated networks of G for a given set of keywords. Each network candidate consists of a set of correlated in-dividual subgraphs and a set of connection nodes,in which each individual subgraph should be a keyword search result candidate and any two individual subgraphs should have strong correlation.By adjusting the correlation ratio,we can control the density of the generated correlated networks.Definition1:(Keyword Matched Nodes)A graph node is a keyword matched node if it directly contains at least one of given keywords.Definition2:(γ-Bounded Keyword Matched Subgraph)A γ-bounded keyword matched subgraph consists of a set of keyword matched nodes,the corresponding connection nodes, and connection edges.It satisfies three conditions:(1)there is at least one occurrence of each given keyword matched node in the subgraph.(2)it keeps all the shortest paths of any two keyword matched nodes in the subgraph.(3)the distance of each shortest path of any two keyword matched nodes in the subgraph is no more than user-specified hop numberγ. Each keyword matched subgraph is a network component. Example2:Consider a keyword query{k1,k2,k3}over the graph G in Figure1.Ifγis set as2,then Figure3shows a2-bounded keyword matched subgraph.In this subgraph, we can easily observe that the minimal distance of any two keyword nodes does not exceed the paring the 2-bounded subgraph and its corresponding part in G,readers may question why we count the node v4,but do not count the node v6?This is because counting v4makes the keyword matched nodes v2and v9to be connected within2hops, otherwise their distance would be3(i.e.,v9−v8−v7−v2)2-bounded subgraphNeighborsFig.3.A2-Bounded Subgraphs for{k1,k2,k3}and its neighbors over Gthat exceeds the bound2.However,deleting v6does not affect the minimal distance of keyword nodes.Definition3:(Correlated Network)A network consists of multipleγ-bounded keyword matched subgraphs and their connection nodes where these subgraphs can be considered as the components of the network.The network is a correlated one if and only if all its components are correlated each other. The correlation of any two components G1and G2can bemeasured by|G 1∩G 2||G 1∪G 2|where G 1contains the nodes in G1and the neighbor nodes of G1,and G 2consists of the nodes in G2 and the neighbor nodes of G2.The above definition extends from the clique semantics where two nodes have high probability of belonging to a dense subgraphs if they have many adjacency nodes.In this work,we can consider each component(γ-bounded keyword matched subgraph)as a“virtual”node and the nodes in or close to the component as the“neighbor”nodes of the“virtual”node.Therefore,if two components can occur in a correlated network,then their corresponding“virtual”nodesshould share many neighbor nodes.In this paper,the neighbor nodes of a “virtual”node include not onlythe nodes inthe component, but also the outlinked nodes of the nodes in the component.2-bounded subgraph neighbors2-bounded subgraph neighborsFig.4.Another Two2-Bounded Subgraphs for{k1,k2,k3}and their neighbors over GExample3:Figure3and Figure4present three2-bounded keyword matched subgraphs for a keyword query{k1,k2,k3} over the graph G in Figure1whereγis set as2.Their cor-responding neighbor nodes are listed at the right side of these subgraphs.Assume the three subgraphs are denoted as G1,G2 and G3and their extended subgraphs are denoted as G 1,G 2 and G 3.E.g.,G 2consists of G2and its neighbors{v3,v4,v8}.To measure the correlations of the three subgraphs,we have|G 1∩G 2| |G 1∪G 2|=0.54;|G 1∩G 3||G 1∪G 3|=0.33;|G 2∩G 3||G 2∪G 3|=0.15.Based on the calculation,we can construct the correlated networks by adjusting the correlation ratio.For instance,if the correlation ratio is set0.5,we can return a correlated network consisting of G1and G2where all the nodes in G 2(i.e.,G2 and its neighbor nodes)are overlapped with G 1.If the ratio is reduced to0.33,we can obtain the second correlated network consisting of G1and G3where v9,v12,v8,v4and v10are the overlapped nodes.Of course,if the ratio is further reduced to 0.15,then we are able to produce a large network that consists of all G1,G2and G3.From Example3,we can see both the nodes in com-ponents and their outlinked nodes may become connection nodes.These nodes should be weighted differently.That is to say,an overlapped node appearing in both components should contribute more to the correlation value of the two components than an overlapped node appearing in only one of the components or outside both components.Therefore,in query evaluation,we can assign1as the weight of each node v in G1because v is an inner node with regard to G1.Based on the diffusion weighted model,each outlinked node v in G 1\G1,in regard to G1,can be weighted byweight(v ,G1)=2−minDist(v ,G1)(1) where minDist(v ,G1)is the minimal distance of the shortest paths from the outlinked node v to any node in G1.Definition4:(Weighted Correlated Networks)Consider any twoγ-bounded keyword matched subgraphs G1and G2.Assume G 1and G 2are their corresponding extended and weighted subgraphs.If G1and G2can be grouped together as a correlated network,then the correlation of G 1and G 2should satisfy a given threshold value,i.e.,{weight(v,G1)∗weight(v,G2)|v∈G 1∩G 2}|G 1∪G 2|is no less than thegiven threshold value.Following Example3,we have the calculated correlations based on Definition4.{weight(v,G1)∗weight(v,G2)|v∈G 1∩G 2}|G 1∪G 2|=0.34;{weight(v,G1)∗weight(v,G3)|v∈G 1∩G 3}|G 1∪G 3|=0.18;{weight(v,G2)∗weight(v,G3)|v∈G 2∩G 3}|G 2∪G 3|=0.03.Due to the above more precise correlation values,it is much easier for us to judge the correlations of the three2-bounded subgraphs.In this paper,we are interested to efficiently discover all the correlated networks based on our formal definitions. In addition,we will also discuss the efficient solution to generate theγ-bounded keyword matched subgraphs and their corresponding extended and weighted subgraphs with regards to the given keyword query.III.S OLUTION O VERVIEWGiven a keyword query Q and an integerγover a large graph G=(V,E),a naive solution is tofirst compute all theγ-bounded keyword matched subgraphs and their extended subgraphs containing their neighbor nodes.Then,it calculates the correlations of any two subgraphs.At last,it groups the subgraphs and their connection nodes based on the selection criteria,i.e.,the given threshold value.However,such a naive solution may be too expensive to be practical due to the following three reasons.•It is impossible to maintain all theγ-bounded keyword matched subgraphs and their extended subgraphs of a big graph in the main memory.It is worth noting that most existing keyword search methods focus onfinding top-k subgraphs in large graph by making use of ranking functions due to the large number of possible keyword search results.However,in this work,we are interested in the correlations among all possible keyword matched subgraphs(the number N of possible keyword matched subgraphs is far greater than k),rather than top-k sub-graphs.The naive solution has to maintain all theγ-bounded keyword matched subgraphs and their extended subgraphs of a large graph.•To generate the correlated networks,we have to measure the correlation of any two keyword matched subgraphs by pair-wise comparisons.However,it is very expensive to do N2times of comparisons and load all the possible keyword matched subgraphs from hard disk(high I/O cost),especially when N is a big number.•To generate all theγ-bounded keyword matched sub-graphs and their extended subgraphs for a large graph,the graph may be passed many times because a graph node may appear in multipleγ-bounded keyword matched subgraphs and their extended subgraphs.To address the above challenges,we are required to propose an efficient and effective approach that can incrementally generateγ-bounded keyword matched subgraphs.For eachγ-bounded keyword matched subgraph G1and its extended sub-graph G 1,we only need to record afixed size of information through making a sample over G 1,i.e.,G 1can be represented by thefixed size of information.As such,we do not read G1and G 1into memory at the run time,by which lots of time can be saved.At last,we can determine the correlation between any twoγ-bounded keyword matched subgraphs by estimating the correlation between their corresponding fixed size of information.This idea is similar to the adapted shingling algorithm in[15],in which the authors group graph nodes based on their neighbor nodes by applying the shingling algorithm[16],[18].Compared with[15]that only focuses on the grouping of general graph nodes,our work has to address more significant challenges:•Computingγ-bounded keyword matched subgraphs is an NP-hard problem discussed below;•To precisely group graph nodes based on user’s request, we not only consider the nodes ofγ-bounded keyword matched subgraphs(like[15]),but also take into account the outlinked nodes of the nodes based on the diffusion weighted model.•Since the weight of a node represents the relative corre-lation from the node to its relatedγ-bounded keyword matched subgraph,the shingling algorithm should be adapted to consider the weights of the nodes to be compared.IV.E VALUATION A LGORITHMSA.Generatingγ-bounded Keyword Matched SubgraphsTo generateγ-bounded keyword matched subgraphs for a set of query keywords and a graph,a straightforward method is to run the Breadth-First Traversal algorithm for each keyword node up toγhops.By doing this,we can generate all theγ-bounded keyword matched subgraph candidates where these candidates may contain duplicates or non-shortest paths. To refine these candidates,we have to check the path of every two nodes in each candidate and prune the non-shortest paths.And then,wefilter the duplicate candidates in thefinal refined candidate set.However,the computational cost of this straightforward method is very expensive due to the following reasons:•a large number of repeated scans on the graph for each keyword node;•a large number of unqualified candidates to be produced;•time spent for identifying the shortest path of any two nodes in every candidate.To reduce the high computational cost,it is required to de-sign an efficient approach to overcome the above shortcomings of the straightforward method.To do this,we devise a new tree data structure to record the shortest path of any two nodes in the graph.As such,generatingγ-bounded keyword matched subgraphs for a set of query keywords over a graph can be realised by calculatingγ-bounded maximal covering keyword matched subtrees over the equivalently transformed tree data structure.The transformation does not depend on any query, which can be done offline.Fig.5.The Transformed Tree Data Structure of Graph GBy making some node copies,we can use a tree data structure to maintain the graph nodes and their shortest paths. Figure5illustrates the transformed tree data structure of the graph data in Figure1.As shown in Figure5,the color nodes represent the copied nodes,where the gray color nodes are copied by only considering directly-connected-edges of graph nodes while the red color nodes are copied by considering thelinked edges of graph nodes up toΓhops(Γ=3here)whereΓis set by system administrators,not the users issuing queries. Based on the general users’requirements or search favors,wecan pre-build a transformed tree for a graph.Obviously,the larger the valueΓis,the higher space cost the transformed treedata structure consumes.Although the space cost increaseswhenΓbecomes larger in theory,most of time,it only needs a little extra cost because the node information are maintainedin a separatefile and only node IDs are used to represent thenode copies.In addition,many copied nodes of takingΓ have been implicitly embraced when we deal withΓ (<Γ ).Forexample,we only need to add three red color copied nodes v8, v12and v11in Figure5in order to satisfyΓ(=3)requirement.This is because most ofother paths(≤2hops)have alreadybeen pre-built when we consider the directly-connected-edges of graph nodes,e.g.,as shown in the dashed line area,v7,v9and v10have been connected to the node v8;v8,v10,v11and v15have been connected in a sequential order.With the help of transformed tree data structure,it becomes easy to compute theγ-bounded keyword matched subgraphs.The basic idea is tofind a set of subtrees where each subtreeshould have a maximal covering of keyword nodes bounded by γ.Here,maximal covering means to include keyword nodesas many as possible,but the maximal distance of any two keyword nodes is bounded toγ.To efficientlyfind the set ofsubtrees,we can read the tree nodes in a Top-Down mannerand incrementally generate the subtree candidates where each candidate is represented by its node set.Although it may stillproduce a few duplicate candidates due to the existence ofcopied nodes,the number of duplicate candidates is much less than that of the straightforward method.Fig.6.1st and2nd steps of processing the tree nodes in Top-Down way Example4:Consider a keyword query{k1,k2,k3}(γ=3)over the transformed tree shown in Figure5.Firstly,we dealwith the nodes at the root(v4),and at the1st level,i.e.,v2,v3, v5,v7,v6,v9,v10.Since v2,v7,v9contain the full keywordstogether,{v4,v2,v7,v9}is a candidate but it is not a maximalcovering candidate.We also know the maximal distance of {v4,v2,v7,v9}is2,which allows us to probe the nodes at the2nd level.For each branch of nodes(in the rectangles),wefirst generate its corresponding subtree that does not count the non-keyword leaf nodes in,and then produce the corresponding candidate node set byfiltering the repeatedFig.7.Candidate set of running1st and2nd steps in Top-Down waynodes.For instance,the candidate sets can be listed in thesame order as shown in Figure7:R1={v4,v2,v7,v9,v1},R2={v4,v2,v7,v9},R3={v4,v2,v7,v9,v8},R4={v4,v2,v7,v9,v6,v8},R5={v4,v2,v7,v9,v8,v12},R6={v4,v2,v7,v9,v10,v8}.Since R2and R3are the subsets ofR1and R4respectively,both R1and R3cannot become aresult candidate.At the next step,we take the child nodes of v4(markedby arrow in Figure7)as the new roots and check theircorresponding subtrees.Since the subtrees rooted at v2,v7,v9and v10contain remaining nodes(marked by the triangle),we need to check these subtrees R1,R3,R5and R6.Let’stake R5as an example.Since we need to probe the nodesat the new level in the subtree rooted at v9,the maximaldistance between the nodes in the subtree and its root is2.As such,we can discard v2and v7because they are outsideof3hops.Subsequently,this leads to the deletion of v4because v4becomes a non-keyword leaf node.By probingthe remaining nodes of R5level by level,the candidateR5 ={v9,v8,v12,v13}will be generated.Similarly,we canget another three candidates:R1 ={v1,v2,v7,v8}from thebranch of R1,R3 ={v2,v7,v8,v9}(subset of R4)fromthe branch of R3and R6 ={v10,v8,v9,v11,v12,v13}(superset of R5 )from the branch of R6.By comparing thesenew generated four candidates and the previous candidatestogether,we can get thefinal result candidates,i.e.,R1,R1 ,R4,R5,R6and R6 .Now,we explain the detailed procedure of transforminggraph G to the new tree structure T.The key idea is to recordthe shortest paths of graph nodes up to the given maximalnumberΓof hops in the tree T.Firstly,we copy the directly-connected-edges from G to T,which guarantees the1-HopCorrectness.And then,we compare the connected edges with2hops between G and T.If some edges do not appear in T,then we need to copy them in T such that all the edges that areconnected within2hops in G should also appear in T,whichguarantees the2-Hop Correctness.Similarly,we can check theconnected edges up toΓhops,which can guarantee theΓ-HopCorrectness.As such,we have the following property.Property1:Given a graph G,its pre-built tree T canguarantee to correctly answer any keyword queries with thegiven hop numberγ≤Γwhereγis a user specified hopnumber value whileΓis the maximal hop number bound setby system administrators.The detailed procedure is provided in Algorithm1.Algorithm1Graph2Tree(G=(V,E),an integerΓ)1:Take any node v∈V with high degree from the graph G;2:Take the node v as the root of a new tree data T;3:Scan G from the node v in the breadth-first traversal;4:for each node v∈V do5:Insert all its directly-connected-edges∈E into T(1-HopCorrectness);6:hopNum=2;7:while hopNum≤Γdo8:for each node v∈V do9:Get sets of v s connected edges with hopNum hops in G;10:for each set of v s connected edges do11:if they appear in T and they are connected then12:{do nothing};13:else14:Copy and insert the necessary edges in T to guaranteethe set of edges appear and are connected in T;15:hopNum++;16:Write T intofile system(hopNum-Hop Correctness);From the above discussion in Example4and the imple-mentation steps in Algorithm1,we can see the benefits ofutilizing our tree data structure to improve the performance ofsearchingγ-bounded keyword matched subgraphs over graphdata.•It can avoid online identification of shortest path inthe straightforward method because the shortest paths ofnodes have been identified and pre-built in the new datastructure.Therefore,the paths appearing in any candidateproduced from the new data structure should be theshortest.•It can reduce the unnecessary scanning cost of thestraightforward method because we can produce all thecandidates by scanning the new data structure just once.•It can incrementally generate all the candidates,unlikethe straightforward method that has to compute everycandidate from the scratch every time(starting from everykeyword node,and probing the connected nodes up toγhops).•The space cost of our transformed tree data structure canbe controlled by varying the hopsΓ.Next,we show the brief procedure of computing theγ-bounded keyword matched subgraphs according to the pre-built tree data structure T.The key idea is to efficientlyfindall the subtrees,in which the maximal distance of any twokeyword nodes is bounded byγ.To do this,wefirstly tranversethe pre-built tree data level by level in a top-down strategyas shown in Example4.For each node v,we probe all thepossibilities that may generate the maximal covering subtreesby analyzing the relationships among the nodes at the currentlevel,the other nodes at its following levels,andγ.Afterthat,we check its child nodes until all nodes are reached.。

相关文档
最新文档