Semantic annotation, indexing, and retrieval

合集下载

情报学的英文专业术语

情报学的英文专业术语Information Science: An Introduction.Information science, often referred to as informatics,is an interdisciplinary field that deals with the study of information, its structure, representation, management, and utilization. At its core, the field explores the principles and practices behind information processing, retrieval, and dissemination. Information scientists are concerned with understanding how information behaves in different contexts, such as libraries, archives, businesses, and computer systems.Key Components of Information Science.1. Information Retrieval: The process of finding, accessing, and delivering relevant information to users based on their needs and queries. This involves the designof efficient retrieval systems and algorithms to help users navigate through vast repositories of data.2. Information Organization: The art and science of classifying, indexing, and cataloguing information to make it easier to find and retrieve. This involves creating structures and systems that help users understand the relationships between different types of information.3. Information Storage and Management: The process of preserving, storing, and managing information over time. This includes considerations for data integrity, security, and access control.4. Information Systems: The design and development of computer-based systems that support information processing, retrieval, and decision-making. These systems range from simple databases to complex information architectures.5. Information Ethics and Privacy: The study of ethical principles and practices related to the collection, use, and dissemination of information. This includes considerations for intellectual property rights, privacy protection, and ethical guidelines for informationprofessionals.Applications of Information Science.Information science finds applications in various fields, including:Libraries and Archives: Libraries and archives rely on information science principles to organize, catalogue, and preserve collections of books, documents, and other materials.Business and Management: Information systems and data analytics are crucial in businesses for decision-making, market research, and operational efficiency.Computer Science: Information science contributes to the design and development of efficient algorithms and data structures for information retrieval and management.Education: Educators use information science techniques to create effective learning materials andteaching strategies that enhance student learning.Healthcare: Information science plays a vital role in healthcare by supporting patient care, medical research, and the management of health records.Future Trends in Information Science.As technology continues to evolve, so do the challenges and opportunities in information science. Here are some key trends shaping the future of the field:Big Data and Analytics: The exponential growth of data has led to a focus on developing efficient algorithms and tools for data analysis and visualization.Artificial Intelligence and Machine Learning: These technologies are revolutionizing information retrieval and recommendation systems, making them more intelligent and adaptive to user needs.Information Security and Privacy: The increasingimportance of protecting sensitive information has led to a focus on developing robust security measures and privacy-enhancing technologies.Semantic Web and Linked Data: These concepts aim to make web-based information more machine-readable and interconnected, enabling more intelligent and semantic-based search and retrieval.User-Centered Design: Information systems are increasingly being designed with a focus on user experience and accessibility, ensuring that they are easy to use and meet the needs of diverse user groups.In conclusion, information science is a dynamic and evolving field that plays a crucial role in our digital world. It continues to adapt and grow as new technologies and challenges emerge, shaping the way we access, manage, and utilize information in various contexts.。

人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱（共包含二级节点15 个，三级节点93 个）间序列分析)监督学习)领域二级分类三级分类。

ABSTRACT Mining, Indexing, and Querying Historical Spatiotemporal Data

Mining,Indexing,and Querying HistoricalSpatiotemporal DataNikos Mamoulis University of Hong Kong Marios Hadjieleftheriou University of California,RiversideHuiping CaoUniversity of Hong KongYufei TaoCity University of Hong KongGeorge KolliosBoston UniversityDavid W.CheungUniversity of Hong KongABSTRACTIn many applications that track and analyze spatiotemporal data,movements obey periodic patterns;the objects follow the same routes(approximately)over regular time intervals. For example,people wake up at the same time and follow more or less the same route to their work everyday.The dis-covery of hidden periodic patterns in spatiotemporal data, apart from unveiling important information to the data an-alyst,can facilitate data management substantially.Based on this observation,we propose a framework that analyzes, manages,and queries object movements that follow such pat-terns.We deﬁne the spatiotemporal periodic pattern mining problem and propose an eﬀective and fast mining algorithm for retrieving maximal periodic patterns.We also devise a novel,specialized index structure that can beneﬁt from the discovered patterns to support more eﬃcient execution of spatiotemporal queries.We evaluate our methods experi-mentally using datasets with object trajectories that exhibit periodicity.Categories&Subject Descriptors:H.2.8[Database Man-agement]:Database Applications-Data Mining Keywords:Spatiotemporal data,Trajectories,Pattern min-ing,Indexing1.INTRODUCTIONThe eﬃcient management of spatiotemporal data has gai-ned much interest during the past few years[10,13,4,12], mainly due to the rapid advancements in telecommunications (e.g.,GPS,Cellular networks,etc.),which facilitate the col-lection of large datasets of such information.Management and analysis of moving object trajectories is challenging due to the vast amount of collected data and novel types of spa-tiotemporal queries.This work was supported by grant HKU7149/03E from Hong Kong RGC and partially supported by NSF grants IIS-0308213and Career Award IIS-0133825.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on theﬁrst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior speciﬁc permission and/or a fee.KDD’04,August22–25,2004,Seattle,Washington,USA.Copyright2004ACM1-58113-888-1/04/0008...$5.00.In many applications,the movements obey periodic pat-terns;i.e.,the objects follow the same routes(approximately) over regular time intervals.Objects that follow approximate periodic patterns include transportation vehicles(buses,boats, airplanes,trains,etc.),animal movements,mobile phone users, etc.For example,Bob wakes up at the same time and then follows,more or less,the same route to his work everyday. Based on this observation,which has been overlooked in past research,we propose a framework for mining,indexing and querying periodic spatiotemporal data.The problem of discovering periodic patterns from histor-ical object movements is very ually,the pat-terns are not explicitly speciﬁed,but have to be mined from the data.The patterns can be thought of as(possibly non-contiguous)sequences of object locations that reappear in the movement history periodically.Moreover,since we do not expect an object to visit exactly the same locations at every time instant of each period,the patterns are not rigid but diﬀer slightly from one occurrence to the next.The pat-tern occurrences may also be shifted in time(e.g.,due to traﬃc delays or Bob waking up late again).The approx-imate nature of patterns in the spatiotemporal domain in-creases the complexity of mining tasks.We need to discover, along with the patterns,aﬂexible description of how they variate in space and time.Previous approaches have stud-ied the extraction of patterns from long event sequences[5, 7].We identify the diﬀerence between the two problems and propose novel techniques for mining periodic patterns from a large historical collection of object movements.In addition,we design a novel indexing scheme that ex-ploits periodic pattern information to organize historical spa-tiotemporal data,such that spatiotemporal queries are eﬃ-ciently processed.Since the patterns are accurate approx-imations of object trajectories,they can be managed in a lightweight index structure,which can be used for pruning large parts of the search space without having to access the actual data from storage.This index is optimized for provid-ing fast answers to range queries with temporal predicates. Eﬀective indexing is not the only application of the mined patterns;since they are compact summaries of the actual tra-jectories,we can use them to compress and replace historical data to save space.Finally,periodic patterns can predict future movements of objects that follow them.The rest of the paper is organized as follows.Section2 presents related work.In Section3,we give a concrete formu-lation of periodic patterns in object trajectories and propose eﬀective mining techniques.Section4presents the indexingscheme that exploits spatiotemporal patterns.We present a concise experimental evaluation of our techniques in Section 5.Finally,Section6concludes with a discussion about future work.2.RELATED WORKOur work is related to two research problems.Theﬁrst is data mining in spatiotemporal and time-series databases. The second is management of spatiotemporal data.Previous work on spatiotemporal data mining focuses on two types of patterns:(i)frequent movements of objects over time and(ii) evolution of natural phenomena,such as forest coverage.[14] studies the discovery of frequent patterns related to changes of natural phenomena(e.g.,temperature changes)in spatial regions.In general,there is limited work on spatiotemporal data mining,which has been treated as a generalization of pattern mining in time-series data(e.g.,see[14,9]).The locations of objects or the changes of natural phenomena over time are mapped to sequences of values.For instance, we can divide the map into spatial regions and replace the location of the object at each timestamp,by the region-id where it is located.Similarly,we can model the change of temperature in a spatial region as a sequence of tempera-ture values.Continuous domains of the resulting time-series data are discretized,prior to mining.In the case of multi-ple moving objects(or time-series),trajectories are typically concatenated to a single long sequence.Then,an algorithm that discovers frequent subsequences in a long sequence(e.g., [16])is applied.Periodicity has only been studied in the context of time-series databases.[6]addresses the following problem.Given a long sequence S and a period T,the aim is to discover the most representative trend that repeats itself in S every T timestamps.Exact search might be slow;thus,[6]pro-poses an approximate search technique based on sketches. However,the discovered trend for a given T is only one and spans the whole periodic interval.In[8],the problem ofﬁnd-ing association rules that repeat themselves in every period of a data sequence is addressed.The discovery of multiple par-tial periodical patterns that do not appear in every periodic segment wasﬁrst studied in[5].A version of the well-known Apriori algorithm[1]was adapted for the problem ofﬁnding patterns of the form*AB**C,where A,B,and C are speciﬁc symbols(e.g.,event types)and*could be any symbol(T= 6,in this example).This pattern may not repeat itself in ev-ery period,but it must appear at least min sup times,where min sup is a user-deﬁned parameter.In[5],a faster mining method for this problem was also proposed,which uses a tree structure to count the support of multiple patterns at two database scans.[7]studies the problem ofﬁnding sets of events that appear together periodically.In each qualifying period,the set of events may not appear in exactly the same positions,but their occurrence may be shifted or disrupted, due to the presence of noise.However,this work does not consider the order of events in such patterns.On the other hand,it addresses the problem of mining patterns and their periods automatically.Finally,[15]studies the problem of ﬁnding patterns,which appear in at least a minimum num-ber of consecutive periodic intervals and groups of such in-tervals are allowed to be separated by at most a time interval threshold.A number of spatial access methods,which are variants of the R–tree[3]have developed for the management of moving object trajectories.[10]proposes3D variants of this access method,suitable for indexing historical spatiotemporal data. Time is modeled as a third dimension and each moving ob-ject trajectory is mapped to a polyline in this3D space.The polyline is then decomposed into a sequence of3D line seg-ments,tagged with the object-id they correspond to.The segments,in turn,are indexed by variants of the3D R–tree, which diﬀer in the criteria they use to split their nodes.Al-though this generic method is always applicable,it stores redundant information if the positions of the objects do not constantly change.Other works[13,4]propose multi-version variants of the R–tree,which share similar concepts to ac-cess methods for time-evolving data[11].Recently[12],there is an increasing interest in(approximate)aggregate queries on spatiotemporal data,e.g.,“ﬁnd the distinct number of objects that were in region r during a speciﬁc time interval”.3.PERIODIC PATTERNS IN OBJECT TRA-JECTORIESIn our model,we assume that the locations of objects are sampled over a long history.In other words,the movement of an object is tracked as an n-length sequence S of spa-tial locations,one for each timestamp in the history,of the form{(l0,t0),(l1,t1),...,(l n−1,t n−1)},where l i is the ob-ject’s location at time t i.If the diﬀerence between consecu-tive timestamps isﬁxed(locations are sampled every regular time interval),we can represent the movement by a simple sequence of locations l i(i.e.,by dropping the timestamps t i, since they can be implied).Each location l i is expressed in terms of spatial coordinates.Figure1a,for example,illus-trates the movement of an object in three consecutive days (assuming that it is tracked only during speciﬁc hours,e.g., working hours).We can model it with sequence S={ 4,9 , 3.5,8 ,..., 6.5,3.9 , 4.1,9 ,...}.Given such a sequence,a minimum support min sup,and an integer T,called period, our problem is to discover movement patterns that repeat themselves every T timestamps.A discovered pattern P is a T-length sequence of the form r0r1...r T−1,where r i is a spatial region or the special character*,indicating the whole spatial universe.For instance,pattern AB*C**implies that at the beginning of the cycle the object is in region A,at the next timestamp it is found in region B,then it moves ir-regularly(it can be anywhere),then it goes to region C,and after that it can go anywhere,until the beginning of the next cycle,when it can be found again in region A.The patterns are required to be followed by the object in at least min sup periodic intervals in S.Existing algorithms for mining periodic patterns(e.g.,[5]) operate on event sequences and discover patterns of the above form.However,in this case,the elements r i of a pattern are events(or sets of events).As a result,we cannot directly apply these techniques for our problem,unless we treat the exact locations l i as discrete categorical values.Nevertheless it is highly unlikely that an object will repeat an identical se-quence of x,y locations precisely.Even if the spatial route is precise,the location transmissions at each timestamp are unlikely to be perfectly synchronized.Thus,the object will not reach the same location at the same time every day,and as a result the sampled locations at speciﬁc timestamps(e.g., at9:00a.m.sharp,every day),will be diﬀerent.In Figure 1a,for example,theﬁrst daily locations of the object are very close to each other,however,they will be treated diﬀerently5x y 51010day 2day 3day 15x y 51010AB CDEFG H IJ KL MN Oday 2day 3day 1 A A C C C G | A A C B D G | A A A C H G events sequence:support(AAC**G) = 2support(AA***G) = 3some partial periodic patterns:support(AA*C*G) = 2(a)an object’s movement (b)a set of predeﬁned regions(c)event-based patternsFigure 1:Periodic patterns in with respect to pre-deﬁned spatial regionsby a straightforward mining algorithm.One way to handle the noise in object movement is to re-place the exact locations of the objects by the regions (e.g.,districts,mobile communication cells,or cells of a synthetic grid)which contain them.Figure 1b shows an example of an area’s division into such regions.Sequence {A ,A ,C ,C ,C ,G ,A ,...}can now summarize the object’s movement and periodic sequence pattern mining algorithms,like [5],can di-rectly be applied.Figure 1c shows three (closed)discovered patterns for T =6,and min sup =2.A disadvantage of this approach is that the discovered patterns may not be very descriptive,if the space division is not very detailed.For ex-ample,regions A and C are too large to capture in detail the ﬁrst three positions of the object in each periodic instance.On the other hand,with detailed space divisions,the same (approximate)object location may span more than one dif-ferent regions.For example,in Figure 1b,observe that the third object positions for the three days are close to each other,however,they fall into diﬀerent regions (A and C )at diﬀerent days.Therefore,we are interested in the automated discovering of patterns and their descriptive regions .Before we present methods for this problem,we will ﬁrst deﬁne it formally.3.1Problem deﬁnitionLet S be a sequence of n spatial locations {l 0,l 1,...,l n −1},representing the movement of an object over a long history.Let T n be an integer called period (e.g.,day,week,month).A periodic segment s is deﬁned by a subsequence l i l i +1...l i +T −1of S ,such that i modulo T =0.Thus,seg-ments start at positions 0,T,...,( n−1)·T ,and there areexactly m = nT periodic segments in S .∗Let s j denote the segment starting at position l j ·T of S ,for 0≤j <m ,and let s j i =l j ·T +i ,for 0≤i <T .A periodic pattern P is deﬁned by a sequence r 0r 1...r T −1of length T ,such that r i is either a spatial region or *.The length of a periodic pattern P is the number of non-*regions in P .A segment s j is said to comply with P ,if for each r i ∈P ,r i =*or s j i is inside region r i .The support |P |of a pattern P in S is deﬁned by the number of periodic segments in S that comply with P .We sometimes use the same symbol P to refer to a pattern and the set of segments that comply with it.Let min sup ≤m be a positive integer ∗If n is not a multiple of T ,then the last n modulo T loca-tions are truncated and the length n of sequence S is reduced accordingly.(minimum support ).A pattern P is frequent ,if its support is larger than min sup .A problem with the deﬁnition above is that it imposes no control over the density of the pattern regions r i .In other words,if the pattern regions are too relaxed (e.g.,each r i is the whole map),the pattern may always be frequent.Therefore,we impose an additional constraint as follows.Let S P be the set of segments that comply with a pattern P .Then each region r i of P is valid if the set of locations R Pi :={s j i |s j ∈S P}form a dense cluster .To deﬁne a dense cluster,we borrow the deﬁnitions from [2]and use two parametersand MinP ts .A point p in the spatial dataset R Pi is a core point if the circular range centered at p with radius contains at least MinP ts points.If a point q is within distance from a core point p ,it is assigned in the same cluster as p .If q is a core point itself,then all points within distance from q areassigned in the same cluster as p and q .If R Pi forms a single,dense cluster with respect to some values of parameters and MinP ts ,we say that region r i is valid.If all non-*regions of P are valid,then P is a valid pattern.We are interested in the discovery of valid patterns only.In the following,we use the terms valid region and dense cluster interchangeably;i.e.,we will often use the term dense region to refer to a spatial dense cluster and the points in it.Figure 2a shows an example of a valid pattern,if =1.5and MinP ts =4.Each region at positions 1,2,and 3forms a single,dense cluster and is therefore a dense region.Notice,however,that it is possible that two valid patterns P and P of the same length (i)have the same *positions,(ii)every segment that complies with P ,complies with P ,and (iii)|P |<|P |.In other words,P implies P .For example,the pattern of Figure 2a implies the one of Figure 2b (denoted by the three circles).A frequent pattern P is redundant if it is implied by some other frequent pattern P .The mining periodic patterns problem searches for all valid periodic patterns P in S ,which are frequent and non-redundant with respect to a minimum support min sup .For simplicity,we will use ‘frequent pattern’to refer to a valid,non-redundant frequent pattern.3.2Mining periodic patternsIn this section,we present techniques for mining frequent periodic patterns and their associated regions in a long his-tory of object trajectories.We ﬁrst address the problem of ﬁnding frequent 1-patterns (i.e.,of length 1).Then,we propose two methods to ﬁnd longer patterns;a bottom-up,yy (a)a valid pattern (b)a redundant patternFigure 2:Redundancy of patternslevel-wise technique and a faster top-down approach.3.2.1Obtaining frequent 1-patternsIncluding automatic discovery of regions in the mining task does not allow for the direct application of techniques that ﬁnd patterns in sequences (e.g.,[5]),as discussed.In order to tackle this problem,we propose the following methodology.We divide the sequence S of locations into T spatial datasets,one for each oﬀset of the period T .In other words,locations {l i ,l i +T ,...,l i +(m −1)·T }go to set R i ,for each 0≤i <T .Each location is tagged by the id j ∈[0,...,m −1]of the seg-ment that contains it.Figure 3a shows the spatial datasets obtained after decomposing the object trajectory of Figure 1a.We use a diﬀerent symbol to denote locations that cor-respond to diﬀerent periodic oﬀsets and diﬀerent colors for diﬀerent segment-ids.by temporal positiony such locationsy (a)T -based decomposition (b)dense clusters in R i ’sFigure 3:locations and regions per periodic oﬀset Observe that a dense cluster r in dataset R i correspondsto a frequent pattern,having *at all positions and r at po-sition i .Figure 3b shows examples of ﬁve clusters discovered in datasets R 1,R 2,R 3,R 4,and R 6.These correspond to ﬁve 1-patterns (i.e.,r 11*****,*r 21****,etc.).In order to iden-tify the dense clusters for each R i ,we can apply a density-based clustering algorithm like DBSCAN [2].Clusters with less than min sup points are discarded,since they are not frequent 1-patterns according to our deﬁnition.Clustering is quite expensive and it is a frequently used module of the mining algorithms,as we will see later.DB-SCAN [2]has quadratic cost to the number of clustered points,unless an index (e.g.,R–tree)is available.Since R–trees are not available for every set of arbitrary points to be clustered,we use a hash-based method,that divides the 2Dspace using a regular grid with cell area √2× √2.This grid is used to hash the points into buckets according to the cell that contains them.The rationale of choosing this cell size is that if one cell contains at least MinP ts points,we know for sure that it is dense and need not perform any range queries for the objects in it.The remainder of the algorithm merges dense cells that contain points within distance (using inex-pensive minimum bounding rectangle tests or spatial join,if required)and applies -range queries from objects located in sparse cells to assign them to clusters and potentially merge clusters.Our clustering technique is fast because not only does it avoid R–tree construction,but it also minimizes ex-pensive distance computations.The details of this algorithm are omitted for the sake of readability.3.2.2A level-wise,bottom-up approachStarting from the discovered 1-patterns (i.e.,clusters for each R i ),we can apply a variant of the level-wise Apriori-TID algorithm [1]to discover longer ones,as shown in Figure 4.The input of our algorithm is a collection L 1of frequent 1-patterns,discovered as described in the previous paragraph;for each R i ,0≤i <T ,and each dense region r ∈R i ,there is a 1-pattern in L 1.Pairs P 1,P 2 of (k −1)-patterns in L k −1,with their ﬁrst k −2non-*regions in the same position and diﬀerent (k −1)-th non-*position create candidate k -patterns (lines 4–6).For each candidate pattern P cand ,we then perform a segment-id join between P 1and P 2and if the number of segments that comply with both patterns is at least min sup ,we run a pattern validation function to check whether the regions of P cand are still clusters.After the patterns of length k have been discovered,we ﬁnd the patterns at the next level,until there are no more patterns at the current level,or there are no more levels.Algorithm STPMine1(L 1,T ,min sup );1).k :=2;2).while (L k −1=∅∧k <T )3).L k :=∅;4).for each pair of patterns (P 1,P 2)∈L k −15).such that P 1and P 2agree on the ﬁrst k −26).and have diﬀerent (k −1)-th non-*position 7).P cand :=candidate gen (P 1,P 2);8).if (P cand =null )then 9).P cand :=P 11P 1.sid =P 2.sid P 2;//segment-id join 10).if |P cand |≥min sup then 11).validate pattern (P cand ,L k ,min sup );12).k :=k +1;13).return P :=SL k ,∀1≤k <T ;Figure 4:Level-wise pattern miningIn order to facilitate fast and eﬀective candidate genera-tion,we use the MBRs (i.e.,minimum bounding rectangles )of the pattern regions.For each common non-*position i the intersection of the MBRs of the regions for P 1and P 2must be non-empty,otherwise a valid superpattern cannot exist.The intersection is adopted as an approximation for the new pat-tern P cand at each such position i .During candidate pruning,we check for every (k −1)-subpattern of P cand if there is at least one pattern in L k −1,which agrees in the non-*posi-tions with the subpattern and the MBR-intersection with it is non-empty at all those positions.In such a case,we ac-cept P cand as a candidate pattern.Otherwise,we know that P cand cannot be a valid pattern,since some of its subpatterns (with common space covered by the non-*regions)are not included in L k −1.Function validate pattern takes as input a k -length can-didate pattern P cand and computes a number of actual k -length patterns from it.The rationale is that the points at all non-*positions of P cand may not form a cluster anymore after the join of P 1and P 2.Thus,for each non-*position of P cand we re-cluster the points.If for some position the points can be grouped to more than one clusters,we create a new candidate pattern for each cluster and validate it.Note that,from a candidate pattern P cand ,it is possible to gener-ate more than one actual patterns eventually.If no position of P cand is split to multiple clusters,we may need to re-cluster the non-*positions of P cand ,since some points (and segment-ids)may be eliminated during clustering at some position.To illustrate the algorithm,consider the 2-length patterns P 1=r 1x r 2y *and P 2=r 1w *r 3z of Figure 5a.Assume that MinP ts =4and =1.5.The two patterns have com-mon ﬁrst non-*position and MBR (r 1x )overlaps MBR (r 1w ).Therefore,a candidate 3-length pattern P cand is generated.During candidate pruning,we verify that there is a 2-length pattern with non-*positions 2and 3which is in L 2.Indeed,such a pattern can be spotted at the ﬁgure (see the dashed lines).After joining the segment-ids in P 1and P 2at line 9of STPMine1,P cand contains the trajectories shown in Fig-ure 5b.Notice that the locations of the segment-ids in the in-tersection may not form clusters any more at some positions of P cand .This is why we have to call validate pattern ,in order to identify the valid patterns included in P cand .Ob-serve that,the segment-id corresponding to the lowermost location of the ﬁrst position is eliminated from the cluster as an outlier.Then,while clustering at position 2,we identify two dense clusters,which deﬁne the ﬁnal patterns r 1a r 2b r 3c and r 1d r 2e r 3f .yy (a)2-length patterns (b)generated 3-length patternsFigure 5:Example of STPMine13.2.3A two-phase,top-down algorithmAlthough the algorithm of Figure 4can ﬁnd all partial pe-riodic patterns correctly,it can be very slow due to the huge number of region combinations to be joined.If the actual patterns are long,all their subpatterns have to be computed and validated.In addition,a potentially huge number of can-didates need to be checked and evaluated.In this section,we propose a top-down method that can discover long patterns more eﬃciently.After applying clustering on each R i (as described in Sec-tion 3.2.1),we have discovered the frequent 1-patterns with their segment-ids.The ﬁrst phase of STPMine2algorithm replaces each location in S with the cluster-id it belongs to or with an ‘empty’value (e.g.,*)if the location belongs to no cluster.For example,assume that we have discov-ered clusters {r 11,r 12}at position 1,{r 21}at position 2,and {r 31,r 32}at position 3.A segment {l 1,l 2,l 3},such thatl 1∈r 12,l 2/∈r 21,and l 3∈r 31is transformed to subsequence {r 12*r 31}.Therefore,the original spatiotemporal sequence S is transformed to a symbol-sequence S .Now,we could use the mining algorithm of [5]to discover fast all frequent patterns of the form r 0r 1...r T −1,where each r i is a cluster in R i or *.However,we do not know whether the results of the sequence-based algorithm are ac-tual patterns,since the contents of each non-*position may not form a cluster.For example,{r 12*r 31}may be frequent,however if we consider only the segment-ids that qualify this pattern,r 12may no longer be a cluster or may form diﬀer-ent actual clusters (as illustrated in Figure 5).We call the patterns P which can be discovered by the algorithm of [5]pseudopatterns ,since they may not be valid.To discover the actual patterns,we apply some changes in the original algorithm of [5].While creating the max-subpattern tree ,we store with each tree node the segment-ids that correspond to the pseudopattern of the node after the transformation.In this way,one segment-id goes to exactly one node of the tree.However,S could be too large to man-age in memory.In order to alleviate this problem,while scanning S ,for every segment s we encounter we perform the following operations.•First,we insert the segment to the max-subpattern tree,as in [5],increasing the counter of the candidate pseudopattern P that s corresponds to after the trans-formation.An example of such a tree is shown in Figure 6.This node can be found by ﬁnding the (ﬁrst)max-imal pseudopattern that is a superpattern of P and following its children,recursively.If the node corre-sponding to P does not exist,it is created (together with any non-existent ancestors).Notice that the dot-ted lines are not implemented and not followed during insertion (thus,we materialize the tree instead of a lat-tice).For instance,for segment with P ={*r 21r 31},we increase the counter of the corresponding node at the second level of the tree.•Second,we insert an entry P .id,s.sid to a ﬁle F ,where P .id is the id of the node of the lattice that corresponds to pseudopattern P and s.sid is the id of segment s .At the end,ﬁle F is sorted on P .id to bring together segment-ids that comply to the same (maxi-mal)pseudopattern.For each pseudopattern with at least one segment,we insert a pointer to the ﬁle posi-tion,where the ﬁrst segment-id is located.Nodes of the tree are labeled in breadth-ﬁrst search order for reasons we will explain shortly.r 11r 21r 31r 11r 21r 32r 12r 21r 31r 12r 21r 32*r 21r 31r 11*r 31r 11r 21*r 21r 32*r 11*r 32r 12*r 31r 12r 21*r 12*r 32rootr 11-r 21-segment-ids containing r 11r 21r 32segment-ids containing 31r 11r 21r segment-ids containing *r 21r 31......503703015segment-ids fileFigure 6:Example of max-subpattern tree Now,instead of ﬁnding frequent patterns in a bottom-up fashion,we traverse the tree in a top-down,breadth-ﬁrst or-。

lsi的名词解释

lsi的名词解释
LSI是潜在语义索引（Latent Semantic Indexing）的缩写，是一种文本挖掘和信
息检索的技术。

它通过对文本语料进行分析和处理，可以帮助改善搜索引擎的准确性和性能。

LSI的基本原理是通过将文本转换成高维的数学向量表示，在向量空间中比较
文本之间的相似性。

LSI首先会构建一个词项-文档矩阵，其中每一行代表一个文档，每一列代表一个词项，矩阵的元素表示词项在文档中的权重。

然后，使用特征值分解技术对这个矩阵进行分解，得到文档的隐含语义。

通过降维和减少噪声，
LSI可以揭示文本之间的语义相关性，从而提高搜索引擎的结果质量。

LSI可以用于各种文本相关的应用，包括信息检索、文本聚类、文本分类等。

在信息检索方面，LSI可以解决传统关键词匹配带来的问题，如同义词、多义词和
相关性不高的结果。

它可以根据文本的语义信息，对查询进行扩展和修正，提供更准确和相关的搜索结果。

在文本聚类和分类方面，LSI可以将相似的文本归为一类，从而帮助用户理解和组织大量的文本信息。

总之，LSI作为一种潜在语义索引的技术，通过对文本进行语义分析和建模，
可以提高搜索引擎的准确性和性能，以及改善文本相关应用的效果。

PACS系统-医学影像的传输

计算机局域网技术成熟，网络速度提高图像数据库技术成熟高分辨力监视器问世
精选ppt课件
7
CD-R
DVD-R
精选ppt课件
8
MO
MO Driver
精选ppt课件
9
磁盘阵列
精选ppt课件
10
网络技术的成熟
精选ppt课件
11
网络技术的成熟
精选ppt课件
12
数据库技术的成熟
精选ppt课件
CT
图像采集工作站
X光机
图像采集工作站
内窥镜
视频采集工作站
PACS结构
因特网
路由器
千兆以太网交换机
百兆以太网交换机
数据库服务器
光缆
显示工作站1
显示
显示
精选ppt课件
工作站2 工作站3
显示工作站4
显示 18
工作站n
PACS数据库服务器的进程与相互关系
PACS 数据库服务器
send 图象采集工作站
直接接口模式
通过一片接口卡实现，例如胶片扫描仪的SCSI接口卡、B超的视频采集卡以及CT的视频采集卡
连接简单，数据吞吐速率快，但不适于作二次开发
精选ppt课件
21
传输网络
设计考虑
每个节点的位置与功能
两节点间通过的信息频度不同节点进行传输所需费用
通信的可靠性要求及所需吞吐量
网络拓扑结构、通信线路容量
13
医用显示器
精选ppt课件
14
医用显示器
精选ppt课件
15
PACS系统的组成
图像获取数据库管理在线存储离线归档图像显示及处理与外部信息系统的接口胶片打印高速局域网络支持远程数据传输的广域网络

人工智能基础(习题卷62)

人工智能基础(习题卷62)第1部分：单项选择题，共50题，每题只有一个正确答案,多选或少选均不得分。

1.[单选题]以下说话正确的是（）A)一个机器学习模型如果有较高准确率，总是说明这个分类器是好的B)如果增加模型复杂度，那么模型的测试错误率不一定会降低C)如果增加模型复杂度，那么模型的训练错误率总是会降低答案:C解析:一个机器学习模型如果有较高准确率，不能说明这个分类器是好的。

对于不平衡的数据集进行预测时，正确率不能反映模型的性能。

模型越复杂，在训练集上越容易表现好，在测试集上越容易表现不好。

2.[单选题]关于卷积层的说法，错误的是（）A)卷积核的尺寸是由人为指定的B)卷积核的参数值是人为指定的C)卷积层可以作为神经网络的隐藏层D)特征图是为卷积层的最终输出答案:B解析:3.[单选题]有两个样本点，第一个点为正样本，它的特征向量是（0, -1）；第二个点为负样本，它的特征向量是（2, 3）,从这两个样本点组成的训练集构建一个线性SVM 分类器的分类面方程是（）。

A)2x+_y=4B)x+2y=5C)x+2y=3D)2x-y=0答案:C解析:对于两个点来说，最大间隔就是垂直平分线，因此求出垂直平分线即可。

斜率是两点连线的斜率的负倒数。

即-1/ （-1-3）/（0-2）=-1/2,可得戶-（l/2）x + C.过中点（0+2） /2, （-1+3）/2）= （1, 1）,可得 c=3/2,故方程为 x+2戶3。

4.[单选题]在具体求解中，能够利用与该问题有关的信息来简化搜索过程，称此类信息为（）A)启发信息B)简化信息C)搜索信息D)求解信息答案:A解析:5.[单选题]下列哪个不是RPA实施回报率的评估因素？（）A)成本节省B)生产力提升C)质量改进D)劳动力需求有规律答案:DA)人机交互系统B)机器人-环境交互系统C)驱动系统D)控制系统答案:A解析:7.[单选题]下面不属于人工智能研究基本内容的是()A)机器感知B)机器思维C)机器学习D)自动化答案:D解析:8.[单选题]大数据正快速发展为对数量巨大、来源分散、格式多样的数据进行采集、存储和关联分析，从中发现新知识、创造新价值、提升新能力的()A)新一代技术平台B)新一代信息技术和服务业态C)新一代服务业态D)新一代信息技术答案:B解析:9.[单选题]梯度下降算法中，损失函数曲面上轨迹最混乱的算法是以下哪种算法?A)SGDB)BGDC)MGDD)MBGD答案:A解析:10.[单选题]当不知道数据所带标签时，可以使用哪种技术促使带同类标签的数据与带其他标签的数据相分离？（）A)分类B)聚类C)关联分析D)隐马尔可夫链答案:B解析:11.[单选题]线性判别分析常被视为一种经典的（）技术。

gensim库中coherencemodel()计算算法

gensim库中coherencemodel()计算算法1. 引言1.1 概述本文将介绍gensim库中的coherencemodel()计算算法。

gensim是一个用于主题建模和文档相似度比较的Python库，其提供了丰富的功能和工具来帮助研究人员和开发者处理自然语言处理任务。

其中，coherencemodel()是gensim 库的一个重要功能，它用于评估主题模型的连贯性。

1.2 文章结构本文将分为五个部分来进行讲解。

首先，在引言部分，我们将对文章进行概述，并介绍文章结构。

然后，在第二部分中，我们将详细介绍gensim库以及coherencemodel()的功能和作用。

接下来，在第三部分中，我们将探讨coherencemodel()算法的实现方法和参数调整策略。

在第四部分中，我们将通过应用场景和案例研究来展示gensim库coherencemodel()在实际项目中的应用价值。

最后，在结论部分，我们将总结评估coherencemodel()算法，并展望其未来发展与应用前景。

1.3 目的本文旨在向读者介绍并深入理解gensim库中coherencemodel()计算算法的原理、实现方法以及在自然语言处理任务中的应用。

通过对coherencemodel()算法的学习和掌握，读者可以更好地评估主题模型的连贯性，并将其应用于相关领域中的实际项目中去。

这将有助于改善主题模型的效果并提升研究人员和开发者在自然语言处理领域的工作效率。

2. gensim库中coherencemodel()计算算法2.1 gensim库简介Gensim是一个用于主题建模和自然语言处理的Python库。

它提供了许多功能来处理文本数据，其中包括coherencemodel()函数。

Gensim的设计目标是高效地处理大规模文本数据集，并提供方便的工具来构建和评估主题模型。

2.2 coherencemodel()功能介绍coherencemodel()函数是Gensim库中用于计算主题模型一致性的方法。

GATE功能介绍(对外)

Noun Phrase Chunker Marking noun phrases in text.
功能介绍
OntoText Gazetteer
与 ANNIE Gazetteer 结果相似，但是算法不同。
Flexible Gazetteer The Flexible Gazetteer provides users with the exibility to choose their own customized input and an external Gazetteer. Gazetteer List Collector
功能介绍
RASP Parser RASP (Robust Accurate Statistical Parsing) is a robust parsing system for English. 包括以下四个PR: RASP2 Tokenizer RASP2 POS Tagger RASP2 Morphological Analyser RASP2 Parser: creates multiple dependency annotations to represent a parse of each sentence. RASP is only supported for Linux operating systems. SUPPLE Parser SUPPLE is a bottom-up parser that constructs syntax trees and logical forms for English sentences. Need a Prolog interpreter. Stanford Parser
与 standard JAPE transducer类似 Plugin

Met-tRNAi^(Met)载体蛋白在急性髓系白血病(AML)中的表达及功能

Apdi 2021VoV41 Ne.42021年 4月第41卷第4期基础医学与临床Basic & C/nicyl Medicine文章编号：144)W325(2421)44 氟 54)氟 7研究论文Mct/WNAO 0-载体蛋白在急性髓系白血病(AML )中的表达及功能苏鹏忠，何家V,于姗，王小爽，余佳**收稿日期：242)氟)氟4 修回日期：242)氟2氟3基金项目：国家自然科学基金杰出青年基金(81775418)* 通信作者(cerresponding authoe ) : j-yp@ iSms.pumo.e0u.cy(中国医学科学院基础医学研究所北京协和医学院基础学院生物化学与分子生物学系医学分子生物学国家重点实验室，北京100045)摘要：目的探讨MetwRNAO 0载体蛋白eIF2A 、eIF2D 和MCTS)在正常造血干细胞(HSCs)和急性髓系白血病(AML )细胞中的表达，并研究上述蛋白对AML 细胞增殖的影响。

方法利用公共测序数据集分析MetwRNAO 0载体蛋白在小鼠HSCs 分化不同阶段中的表达；利用单细胞转录组测序数据集分析MetwRNAO 0-载体蛋白在人HSCs 分化不同阶段中的表达；利用公共测序数据集分析MetwRNAO 0载体蛋白在健康人和AML 患者骨髓细胞中的表达。

在人急性髓细胞样白血病细胞系MOLM13中利用慢病毒传导分别抑制eIF2A或eIF2D 的内源表达，检测细胞的增殖和周期变化。

利用sduUrOX (14 ,xmol/E)处理和血清饥饿的方式处理白血病细胞系使其进入应激状态，检测eIF2A 和eIF2D 的表达改变。

结果eIF2A 和eIF2D 在造血干祖细胞中的表达显著高于成熟血液细胞，并且eIF2 A 和eIF2 D 在AML 患者中表达高于健康人。

与之相反，MCTS)在正常造血分化过程和AML 细胞中均未见表达差异。

抑制内源eIF2A 或eIF2D 的表达后，MOLM18细胞的增殖能力受到显著抑制(P<4. 44)),且细胞周期阻滞在G3/M 期(P<2.02))。

the European Communities

WonderWeb Deliverable D16Reusing semi-structured terminologies for ontology building: A realistic case study in fishery information systemsAldo Gangemi ISTC-CNR email: a.gangemi@r.itIdentifier Class Version Date Status Distribution Lead PartnerD16 Deliverable 1.0 7-05-2004 public ISTC-CNR1IST Project 2001-33052 WonderWeb: Ontology Infrastructure for the Semantic WebiiIST Project 2001-33052 WonderWeb: Ontology Infrastructure for the Semantic WebWonderWeb ProjectThis document forms part of a research project funded by the IST Programme of the Commission of the European Communities as project number IST-2001-33052. For further information about WonderWeb, please contact the project co-ordinator:Ian Horrocks The Victoria University of Manchester Department of Computer Science Kilburn Building Oxford Road Manchester M13 9PL Tel: +44 161 275 6154 Fax: +44 161 275 6236 Email: wonderweb-info@iiiIST Project 2001-33052 WonderWeb: Ontology Infrastructure for the Semantic WebRevision InformationRevision Version date 5-05-2004 V1.0 ChangesivIST Project 2001-33052 WonderWeb: Ontology Infrastructure for the Semantic WebTable of Contents11.1 1.2INTRODUCTION................................................................... 6Bootstrapping dedicated semantic webs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 A bit of history. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722.1 2.2 2.3THE FISHERY CASE STUDY: RESOURCES, ISSUES, AND METHODS ........................................................................... 8Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Some issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Some methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 133.1 3.2 3.3 3.4KOS REENGINEERING LIFECYCLE .....................................12Formatting and lifting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Formalization, and Core ontology building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 4 Modularization, and alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 7 Annotation, refinement, merging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 044.1 4.2 4.3POST-PROCESSING LIFECYCLE ........................................24Services for information retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 5 Services for distributed database querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 6 Tools .................................................................................... 2 75 6FURTHER DISCUSSION ON THE CASE STUDY AND ITS RELEVANCE TO THE SEMANTIC WEB ................................29 REFERENCES ....................................................................31vIST Project 2001-33052 WonderWeb: Ontology Infrastructure for the Semantic Web1 Introduction1.1 Bootstrapping dedicated semantic websA main issue in the deployment of the Semantic Web (SW) is currently its population: very few ontologies and tagged documents exist in comparison to the huge amount of domains and documents that exist on the Web. Several strategies are being exploited to bootstrap the SW: machine learning [1,2,3], NLP techniques [4,5], semantic services [6], lifting existing metadata [7,8,9,10,11,12], etc. These strategies have different advantages according to the type of documents or domains: while machine learning and NLP techniques try to extract useful recurrent patterns out of existing (mostly free text or semi-structured) documents, and semantic services try to generate semantically indexed, structured documents e.g. out of transactions, existing metadata can be considered proto-ontologies that can be "lifted" from legacy indexing tools and indexed documents. In other words, metadata lifting ultimately tries to reengineer existing document management systems into dedicated semantic webs.1 Legacy information systems often use metadata contained in Knowledge Organization Systems (KOSes), such as vocabularies, taxonomies and directories, in order to manage and organize information. KOSes support document tagging (thesaurus-based indexing) and information retrieval (thesaurus-based search), but their semantic informality and heterogeneity usually prevent a satisfactory integration of the supported documentary repositories and databases. As a matter of fact, traditional techniques mainly consist of time-consuming, manual mappings that are made – each time a new source or a modification enter the lifecycle – by experts with idiosyncratic procedures. Informality and heterogeneity make them particularly hostile with reference to the SW. This document describes the methodology used for the creation, integration and utilization of ontologies for information integration and semantic interoperability, with respect to a case study: fishery information systems. Such a case study, which is definitely not a toy example, has been the target of an institutional project carried out by CNR and UN-FAO, which exploited the DOLCE ontology and the methods developed within the WonderWeb project, as well as previous methodologies developed in the past by ITBM-CNR2 We describe various methods to reengineer, align, and merge KOSes in order to build a large fishery ontology library. Some examples of semantic services based on it, either for a simple one-access portal or a sophisticated web application are also sketched, which envisage a fishery semantic web. With respect to the main threads of WonderWeb (languages, tools, foundational ontologies, versioning, and modularity), we concentrate this section on a demonstration of KOS reengineering issues from the viewpoint of formal ontology, therefore the main threads will appear in the context of the case study description rather than as explicitly addressed topics. We assume a basic knowledge of the deliverable D18 for full comprehension of this section.1Notice that the different strategies are not mutually exclusive, but can be combined. In the FOS project, we have also used techniques from NLP and semantic services. 2 The former ontology group of ITBM-CNR has now joined ISTC-CNR6IST Project 2001-33052 WonderWeb: Ontology Infrastructure for the Semantic WebWe thank the UN-FAO WAICENT-GILW department for allowing us to reuse in this deliverable some of the FOS project documentation.1.2 A bit of historyIn the beginning of 2002 the Food and Agriculture Organization of the United Nations (FAO, in the following)1 , based in Rome, took action in order to enhance the quality of its information and knowledge services related to fishery. The following internal agencies were asked to participate into a task-force by providing manpower and/or data, information or knowledge repositories: FAO Fishery Department2 provided the reference tables of its Internet portal, the Fishery Global Information System (FIGIS). ASFA secretariat3 , the managing body of the Aquatic Sciences and Fisheries contributed its online thesaurus for fishery. SIFAR, the Support unit for International Fisheries and Aquatic Research4 , contributed the contents and the structure of the oneFish community directory. FAO WAICENT, the World Agricultural Information Centre5 , provided access, through its office for General Information Systems and Digital Libraries (GILW), to the fishery part of the AGROVOC Thesaurus. FOS naturally fitted the wider AOS (Agriculture Ontology Service) long-term programme6 , started by FAO at the end of 2001, of which FOS constitutes one major case study (together with the Food Safety project [12], and others). The scientific coordination and supervision of the FOS project was assigned to the Laboratory for Applied Ontology of the Institute of Cognitive Sciences and Technology of the Italian National Research Council (LOA, in the following)7 . The outline of the project and the preliminary methods have already been presented in [13]. Here we describe some salient aspects of the FOS project after the completion of the first phase (2002-2003), which show the principles (and their applicability) that can be adopted when reengineering semi-structured KOSes into formal ontologies, in formats and with the tools envisaged by the WonderWeb project. Section 2 describes the sources that were subject to reengineering, integration, alignment, and merging, and the general issues and principles. Section 3 presents the methodology with more detail, an outline of the global results, and provides some examples of the interoperability between the sources, which was achieved. Finally, section 4 draws some conclusions.1 2 /figis/servlet/FiRefServlet?ds=staticXML&xml=webapps/figis/wwwroot/fi/figis/index.x ml&xsl=webapps/figis/staticXML/format/webpage.xsl 3 /fi/asfa/asfa.asp 4 /global/about.htm 5 /WAICENT/ 6 /agris/aos 7 http://www.loa-cnr.it7IST Project 2001-33052 WonderWeb: Ontology Infrastructure for the Semantic Web2 The fishery case study: resources, issues, and meth ods2.1 ResourcesThe following resources have been singled out from the fishery information systems considered: OneFish topic trees OneFish [14] is a portal for fishery activities and a participatory resource gateway for the fisheries and aquatic research and development sector. It contains heterogeneous data, organized through hierarchical topic trees (more than 1,800 topics, increasing regularly), made up of hierarchical topics with brief summaries, identity codes and attached knowledge objects (documents, web sites, various metadata). The hierarchy (average depth: 3) is ordered by (at least) two different relations: subtopic, and intersection between topics, the last being notated with @ , similarly to relations found in known subject directories like DMOZ. There is one 'backbone' tree consisting of five disjoint categories, called worldviews (subjects, ecosystem, geography, species, administration) and one worldview (stakeholder), maintained by the users of the community, containing own topics and topics that are also contained in the first four other categories (Figure 5). Alternative trees contain new 'conjunct' topics deriving from the intersection of topics belonging to different categories. AGROVOC thesaurus AGROVOC [15] has been developed by FAO and the Commission of the European Communities in the early 1980s and is used for document indexing and retrieval. It is a multilingual structured and controlled vocabulary designed to cover the terminology of all subject fields of agriculture, forestry, fisheries, food and related domains (e.g. environment) in order to describe the documents in a controlled language system. Different hierarchical and associative relations (broader/narrower terms, related terms, equivalent terms, used for) are established between the terms. AGROVOC contains approximately 2,000 fishery related descriptors out of about 16,000 descriptors. ASFA thesaurus ASFA [16] is an abstracting and indexing service covering the world's literature on the science, technology, management, and conservation of marine, brackishwater, and freshwater resources and environments, including their socio-economic and legal aspects. The thesaurus is an online service, which provides terminological definitions in terms of various relations, e.g. narrower term, related term, used for. It consists of more than 6,000 descriptors. FIGIS reference tables FIGIS [17] is a global network of integrated fisheries information. Presently its thematic sections are four: aquatic species (i.e. biological information); geographic objects (water and continental areas, political geographic entities); marine resources (information on the state of world resources, data on regional fish stocks, major issues affecting stocks); marine fisheries (data and maps on the exploitation of the major species, management-related information) fishing technologies (information on high seas vessels identification, on the selection of technologies and on training and on international legal issues). The FIGIS reference tables comprise all the contents of this huge database. The reference tables consists of approximately 200 top-level concepts, with a max depth of 4, 30,000 'objects'8IST Project 2001-33052 WonderWeb: Ontology Infrastructure for the Semantic Web(mixed concepts and individuals), relations (specialized for each top category, but scarcely instantiated) and multilingual support. FIGIS DTDs Some XML Document Types Definitions (now moving to RDFS) are also maintained by FIGIS to organize their databases. The original set included 823 elements with a rich attribute structure. Those related to fishery ontologies have been taken into account.2.2 Some issuesAs mentioned in the introduction the sources to be integrated were rather variate under many perspectives(semantic, lexical and structural).9IST Project 2001-33052 WonderWeb: Ontology Infrastructure for the Semantic WebAQUACULTURE (AGROVOC) NT1 fish culture NT2 fish feeding NT1 frog culture … rt agripisciculture rt aquaculture equipment … Fr aquaculture Es acuicultura AQUACULTURE (ASFA) NT Brackishwater aquaculture NT Freshwater aquaculture NT Marine aquaculture rt Aquaculture development rt Aquaculture economics rt Aquaculture engineering rt Aquaculture facilitiesBiological entity (FIGIS) Taxonomic entity Major group Order Family Genus Species Capture species (filter) Aquaculture species (filter) Production species (filter) Tuna atlas spec SUBJECT (OneFish) Aquaculture Aquaculture development Aquaculture economics @ Aquaculture planningTable 1. Sample aquaculture descriptors in the four resources. NT means narrower than; rt means related term, Fr and Es are the corresponding French and Spanish terms)An example of how formal ontologies can be relevant for fishery information services is shown by the information that someone could get if interested in aquaculture (Tab. 1). In fact, beyond simple keyword-based searching, searches based on tagged content or sophisticated natural-language techniques require some conceptual structuring of the linguistic content of texts. The four systems concerned by this case study provide this structure in very different ways and with different conceptual “textures”. For example (Tab. 1), the AGROVOC and ASFA thesauri put aquaculture in the context of different thesaurus hierarchies. The AGROVOC thesaurus seems to conceptualize aquaculture types from the viewpoint of techniques and species. ASFA aquaculture hierarchy is substantially different, since the hierarchy seems to stress the environment and disciplines related to aquaculture. A different resource is constituted by the so-called reference tables in FIGIS system; the only reference table mentioning aquaculture puts it into another context (taxonomical species). The last resource examined is oneFish directory, which returns a context related to economics and planning. With such different interpretations of aquaculture, we can reasonably expect different search and indexing results. Nevertheless, our approach to information integration and ontology building is not that of creating a homogeneous system in the sense of a reduced freedom of interpretation, but in the sense of navigating alternative interpretations, querying alternative systems, and conceiving alternative contexts of use. Once made clear that different fishery information systems provide different views on the domain, we directly enter the paradigm of ontology integration, namely the integration of schemas that are arbitrary logical theories, and hence can have multiple models (as opposed to database schemas that have only one model) [19]. As a matter of fact, thesauri, topic trees and reference tables used in the systems to be integrated could be considered as informal schemata conceived in order to query semistructured or informal databases such as texts, forms and tagged documents. In order to benefit from the ontology integration framework, we must transform informal schemata into formal ones. In other words, thesauri and other terminology management resources must be transformed into formal ontologies. In order to do this, we require a comprehensive set of ontologies that are designed in a way that admits the existence of many possible pathways among concepts under a common conceptual framework. In our opinion, the framework should: — reuse domain-independent ontologies shared by the resources, in order to make the different components interoperate 10—be flexible enough, so that different views have a common context—be focused on the core reasoning schemata for the fishery domain, otherwise the common conceptual framework would be too abstract.Domain-independent, foundational ontologies [18] characterise the general notions needed to talk about economics, biological species, fish production techniques; for example: parts, agents, attribute, aggregates, activities, plans, devices, species, regions of space or time, etc.Furthermore, so-called core ontologies [18] characterise the main conceptual schemata that the members of the fishery community use to reason, e.g. that certain plans govern certain procedures involving certain devices applied to activities like capturing fish of a certain species in certain areas of water regions, etc.Foundational and core ontologies provide the framework to integrate in a meaningful and intersubjective way different views on the same domain, such as those represented by the queries that can be done to a set of distributed information systems containing (un)structured data.2.3 Some methodsIn order to perform this reengineering task, we have applied the techniques of three methodologies: ap-plication of DOLCE foundational principles introduced in WonderWeb D18 [18], ONIONS [20], and OnTopic [21].WonderWeb D18 contains principles for building and using foundational ontologies for core and domain ontology analysis, revision, and development. DOLCE is an axiomatic, domain-independent theory based on formal principles.ONIONS is a set of methods for reengineering (in)formal domain metadata, such as glossaries, terminologies, data models, conceptual schemata, business models, etc. to the status of formal ontology data types, for integrating them in a common formal structure, for aligning them to a foundational ontology, and for merging them. Some methods are aimed at reusing the structure of hierarchies (e.g., BT/NT relations, subtopic relation, etc.), the additional relations that can be found (e.g., RT relations), and at analysing the compositional structure of terms in order to capture new relations and definitional elements. Other methods concern the management of semantic mismatches between alternative or overlapping ontologies, and the exploitation of systematic polysemy t o discover relevant domain conceptual structures.OnTopic is about creating dependencies between topic hierarchies and ontologies. It contains methods for deriving the elements of an ontology that describe a given topic, and methods to build “active” topics that are defined according to the dependencies of any individual, concept, or relation in an ontology. OnTopic has only suggested design decisions in the case study.In section 3, we describe these methods as used in the KOS reengineering lifecycle, the types of data extracted from the fishery resources, with examples of their porting, translation, transformation, and refinement.In section 4 we finally give a resume of the tools tested and/or endorsed in the case study.3 KOS REENGINEERING LIFECYCLEIn Fig.s 1,3,6,8,9, a UML "activity diagram" is shown that summarizes the main steps of the methods we have followed to create the Fishery Ontology Library (FishOL). For the sake of readability, we have split the activity diagram into five pieces as follows:1)Terminological database (TDB) formatting and schema lifting2)TDB porting, formalization, and Core ontology building3)Modularization, ontology library building, and alignment to reference ontologies4)Annotation, refinement, and merging of the library5)Measures for finalisation, maintenance, and exploitation3.1Formatting and liftingIn the first phase of the lifecycle (Fig. 1), the original terminological databases are im-ported into a common database format. The conceptual schemata of the databases are lifted (either manually, or by using automatic reverse engineering components [11]). At the same time, a common Ontology Data Model (ODM) should be chosen. This can be partly derived from the semantics of ontology representation languages (e.g. the OWL ODM [22]), enhanced with criteria for distinguishing the different data types at the onto-logical level (e.g. individual, class, meta-property, relation, property name, lexicon, etc.). Ontologically explicit ODMs are described in [23,24].With the help of the ODM, lifted schemata can be translated and then integrated (the integration methodology assumed is [19]).In FOS, the original TDBs resulted to be syntactically heterogeneous, specially FIGIS with respect to ASFA and AGROVOC. In fact, the first is controlled through a set of XML DTDs (currently moving to RDFS), while the seconds are implemented in relational databases with one basic relational table.Semantically, TDB schemata are even more heterogeneous (see Table 1 for examples). ASFA is a typical thesaurus, made up of descriptors (equivalence classes of terms with the same assumed meaning), equivalent terms, and relations among descriptors (BT, NT, RT, UF) that create a forest structure (an indirect acyclic graph, [25]). Descriptors are encoded via a "preferred" term. AGROVOC is also a thesaurus, but contains multilingual equivalent terms, and descriptors are encoded via alphanumeric codes. FIGIS is not a thesaurus, but a collection of TDBs organised into modules containing different domain terminologies, e.g. vessels, organisms, techniques, institutions, etc. Equivalence classes of multilingual terms are defined (similar to thesaurus descriptors). Each equivalence class has an identification code. Each module has a peculiar schema including local relations defined on (classes of) classes of terminological equivalence classes. E.g., a relation between institutions and countries, a relation between vessels and techniques, between organism species and genera, etc.Figure 1. A UML activity diagram for formatting and lifting activities.These relations are more informative than generic RT thesaurus relations (see phase 2 about additional transforms to TDB).FIGIS DTDs encode heterogeneous metada for the management of the FIGIS database. These XML elements can refer to domain-specific information (e.g. "Location"), datatypes (e.g. "Date"), data about data (e.g. "Available"), foreign keys (e.g. "AqSpecies_Text").AdministrationSubjects EcosystemStakeholdersGeography SpeciesFig. 2. Topic spaces ("worldviews") in oneFish.Finally, OneFish is a tree structure of subjects (keywords used to classify documents), with multihierarchical links, similar to Web directories like DMOZ [26]. The top subjects in OneFish are depicted in Fig. 2.The integrated schema results to include all the data types from the TDBs. On the other hand, we needed to interpret the original data types into an (onto)logically valid integrated schema. Therefore, we have created a mapping from each (domain-related) legacy data type to an ODM data type, e.g. owl:Class, to which "descriptor" and "FIGIS equivalence class" have been mapped, owl:ObjectProperty, to which "RT" and most FIGIS relations have been mapped (as instances), topic, to which OneFish "subject" has been mapped, etc.As explained below, some adjustments are needed to the original TDBs in order to preserve a correct semantics when translating some elements to the integrated schema.3.2Formalization, and Core ontology buildingAfter a common format and an integrated ontology data model have been obtained, the second phase (Fig. 3) starts by choosing an Ontology Representation Language (ORL).E.g., in FOS some tests have been performed at the beginning of the project, and we have decided to take a layered approach, maintaining the TDBs into different ontology re-positories represented into languages of increasing expressivity. RDF(S) [27] has been chosen for the basic layer, DAML+OIL [28] (currently OWL-DL [22]) for the middle layer, and KIF [29] for the expressive layer.The reason for such layering resides in a) the necessity of carrying out certain ontology learning procedures (see phase 4) with the expressive version, b) the necessity of using the standard Semantic Web ontology languages to carry out inferences with the middle layer, and c) the necessity of maintaining a lightweight ontology with the basic layer. RDF(S) can also be used to import the original TDBs without using the ODM. In fact, a preliminary decision was required when deciding how the ontologies that were obtained from the TDBs should get used.1The first choice has been to preserve the TDB elements into the original data models. In this case, no mapping has been performed from original data models to the ODM, and only an integrated (non-refined) data model has been used. The advantage of this choice is that no interpretation is performed on the legacy TDBs, but there are two disadvantages: translated TDBs are not (proto-)ontologies, but RDF models, hence no ontology inferencing can be made using them; imported TDBs cannot be aligned or merged, but only integrated.The second choice has been to translate the TDBs according to the ODM, then interpreting and mapping the original data models, and making needed refinements in order to preserve the semantics of ODM. This solution overcomes the disadvantages of the first choice at the cost of making interpretations. E.g. in FOS the maintainers of the legacy TDBs are members of the task-force, then we can expect that interpretations are not harmful. In other contexts – specially if experts are not collaborating –interpretations may be more problematic.Figure 3. The activity diagram for metadata formalization and Core ontology building.In the case study, the first choice has been easily produced through a rather economic procedure. Most efforts have then been put into translating and sometimes transforming the TDBs into proto- and then full-fledged ontologies. In particular, a translation t o1 A similar problem is discussed in the W3C SW Best Practices and Deployment Working Group wrt to wordnets and thesauri [30].ODM data types has been performed.For certain terminological data types, a refinement is performed at this stage and after alignment (see phase 3).For example, AGROVOC makes no difference between descriptors denoting owl:classes (e.g. agrovoc:River), and descriptors denoting owl:individuals (e.g. agrovoc:Amazon). Most individuals have been found in subdomains like geography and institutions. Another example concerns thesauri relations: while RT (Related Term) needs no refinement wrt ODM: it is imported as a subproperty of owl:ObjectProperty holding between individuals (and defined on classes), and UF is an owl:DatatypeProperty holding between lexical items (strings), on the contrary BT (Broader Term) is usually the rdfs:subClassOf property, but sometimes it is used as a "part of" owl:ObjectProperty. Translation and refinement have been complemented by transforming the applications of RT and of owl:ObjectProperties from FIGIS into formal owl:Restrictions.The working hypotheses in making these transformations are that:—the resulting owl:Restrictions are inheritable to all the subclasses of the rdfs:class to which the restriction pertain, and—the quantification applicable to restrictions is owl:someValuesFromBoth hypotheses result confirmed in most FOS cases, e.g. in AGROVOC, from the original record:<Fishing vessel> <RT> <Fishing gear>it is semantically correct to derive the following transform (we use OWL abstract syntax [31] for most examples in this section of the deliverable):Class(agrovoc:Fishing_vessel partial(restriction(agrovoc:RT someValuesFrom(agrovoc:Fishing_gear))))Figure 4. The DOLCE+ top level.In phase 4 we explain that RT restrictions can be refined in order to make their intended meaning more precise.A concurrent task has been performed during the translation and tranformation phase, which provides the means to fulfil the tasks in phase 3. Such task is about the construction of a Core Ontology, in this case study a Core Ontology of Fishery (COF). For the many theoretical underpinnings in core ontology construction that come from modularization and reuse wrt foundational ontologies, we refer to [18]. As an example, we only provide here a basic description of COF and of the reusable reference ontologies that have been employed.COF has been designed by specializing the DOLCE-Lite-Plus ("DOLCE+" in the following, Fig. 4 shows the most general classes) ontology[18], developed within。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Semantic Annotation, Indexing, and Retrieval Atanas Kiryakov, Borislav Popov, Damyan Ognyanoff, Dimitar Manov, AngelKirilov, Miroslav GoranovOntotext Lab, Sirma AI EOOD, 138 Tsarigradsko Shose, Sofia 1784, Bulgaria{naso, borislav, damyan, mitac, angel, miro}@sirma.bgAbstract. The Semantic Web realization depends on the availability of criticalmass of metadata for the web content, linked to formal knowledge about theworld. This paper presents our vision about a holistic system allowingannotation, indexing, and retrieval of documents with respect to real-worldentities. A system (called KIM), partially implementing this concept is shortlypresented and used for evaluation and demonstration.Our understanding is that a system for semantic annotation should be basedupon specific knowledge about the world, rather than indifferent to anyontological commitments and general knowledge. To assure efficiency andreusability of the metadata we introduce a simplistic upper-level ontologywhich starts with some basic philosophic distinctions and goes down to themost popular entity types (people, companies, cities, etc.), thus providing manyof the inter-domain common sense concepts and allowing easy domain-specificextensions. Based on the ontology, an extensive knowledge base of entitiesdescriptions is maintained.Semantically enhanced information extraction system providing automaticannotation with references to classes in the ontology and instances in theknowledge base is presented. Based on these annotations, we perform IR-likeindexing and retrieval, further extended using the ontology and knowledgeabout the specific entities.1 IntroductionSemantic Web is about adding formal semantics (metadata, knowledge) to the web content for the purpose of more efficient access and management. Since its vitality depends on the presence of critical mass of metadata, the acquisition of this metadata is a major challenge for the Semantic Web community. Though, in some cases unavoidable, the manual accumulation of this explicit semantics is not considered a feasible approach. Our vision is that fully automatic methods for semantic annotation should be researched and developed. For this to happen, the necessary design and modeling questions should be faced and resolved, and the enabling complementary resources and infrastructure should be provided. To assure wide acceptance and usage of semantic annotation systems their tasks should be clearly defined, their performance – properly evaluated and communicated.The semantic annotation offered here is a specific metadata generation and usage schema targeted to enable new information access methods and extend existing ones. The annotation scheme offered is based on the understanding that the named entities(NE, see 1.1) mentioned in the documents constitute important part of their semantics. Further, using different sorts of redundancy, external or background knowledge, those entities can be coupled with formal descriptions and thus provide more semantics and connectivity to the web. We hope that the expectations towards the Semantic Web will be easier to realize if the following basic tasks can be defined and solved:1.Annotate and hyperlink (references to) named entities in text documents;2.Index and retrieve documents with respect to the referred entities.The first task can be seen as an advanced combination of basic press-clipping exercise, typical IE1 task, and automatic hyper-linking. The resulting annotations represent basically a document enrichment and presentation method, which can further be used to enable other access methods.The second task is just a modification of the classical IR task – documents are retrieved based on relevance to NEs instead of words. However the basic assumption is quite similar – the documents are characterized by the bag of tokens2 constituting their content, disregarding its structure. While the basic IR approach considers as tokens the word stems, for the last decade there was considerable effort towards using word-senses or lexical concepts (see [20] and [36]) for indexing and retrieval. The named entities can be seen as special sort of token to be taken care of. What we present here is one more (pretty much independent) development direction instead of alternative of the contemporary IR trends.Fig. 1. Semantic AnnotationIn a nutshell, Semantic Annotation is about assigning to the entities in the text links to their semantic descriptions (as presented on Fig. 1). This sort of metadata provides 1 Information extraction, a relatively young discipline in the Natural Language Processing (NLP), which conducts partial analysis of text in order to extract specific information, [6].2 Or “atomic text entities” as those are referred in [17].both class and instance information about the entities. It is a matter of terminology whether these annotations should be called “semantic”, “entity” or some other way. To the best of our knowledge there is no well-established term for this task; neither there is a well-established meaning for “semantic annotation”. What is more important, the automatic semantic annotations enable many new applications: highlighting, indexing and retrieval, categorization, generation of more advanced metadata, smooth traversal between unstructured text and available relevant knowledge. Semantic annotation is applicable for any sort of text – web pages, regular (non-web) documents, text fields in databases, etc. Further, knowledge acquisition can be performed based on extraction of more complex dependencies – analysis of relationships between entities, event and situation descriptions, etc.This paper presents a schema for automatic semantic annotation, indexing and retrieval, together with a discussion on number of design and modeling questions (section 2) followed by discussion on the process (section 3). In section 4, we present a software platform, KIM, which demonstrates this model based on the latest Semantic Web and Information Extraction technology. The fifth section provides survey on related work. Conclusion and future work are discussed in section 6.1.1 Named EntitiesIn the NLP and particularly IE tradition, named entities are considered: people, organizations, locations, and others referred by name. In a wider interpretation, those include also scalar values (numbers,dates, amounts of money), addresses, etc.The NEs require different handling because of their different nature and semantics3 as opposed to the words (terms, phrases, etc.) While the former denote particulars (individuals or instances), the later denote universals (concepts, classes, relations, attributes). While the words can be described with the means of lexical semantics and common sense, the understanding and managing of named entities, requires more specific world knowledge.2. Semantic Annotation Model and RepresentationHere we discuss the structure and the representation of the semantic annotations, including the necessary knowledge and metadata. There are number of basic prerequisite for representation of semantic annotations:•Ontology (or at least taxonomy) defining the entity classes. It should be possible to refer to those classes;•Entity identifiers which allow those to be distinguished and linked to their semantic descriptions;•Knowledge base with entity descriptions.The next question considers an important choice for the representation of the annotations – “to embed or not to embed?” Although the embedded annotations seem 3 Without trying to discuss what semantic means in general, we simplify it down to “a model or description of an object which allows further interpretation.”Fig. 2. Distributed HeterogeneousKnowledge easier to maintain, there are number of arguments providing evidence that the semantic annotations have to be decoupled from the content they refer to. One key reason is to allow dynamic, user-specific, semantic annotations – the embedded annotations become part of the content and may not change corresponding to the interest of the user or the context of usage. Further, embedded complex annotations would have negative impact on the volume of the content and can complicate its maintenance – imagine that page with three layers of overlapping semantic annotations need to be updated preserving them consistent. Those and number of other issues defending the externally encoded annotation can be found in [34] which also provides an interesting parallel to the open hypermedia systems.Once decided that the semantic annotations has to be kept separate from the content, the next question is whether or not (or how much) to couple the annotations with the ontology and the knowledge base? It is the case that such integration seems profitable – it would be easier to keep in synch the annotations with the class and entity descriptions. However, there are at least two important problems:• Both the cardinality and the complexity of the annotations differ from those ofthe entity descriptions – the annotations are simpler, but their count is usually much bigger than this of the entity descriptions. Even considering middle-sized document corpora the annotations can reach tens of millions. Suppose 10M annotations are stored in an RDF(S) store together with 1M entity descriptions. Suppose also that each annotation and each entity description are represented with 10 statements. There is a considerable difference regarding the inference approaches and hardware capable inefficient reasoning and access to 10M-statement repository and with 110M-statement repository.• It would be nice if the world knowledge(ontology and instance data) and thedocument-related metadata are keptindependent. This would mean that forone and the same document differentextraction, processing, or authoringmethods will be able to deliveralternative metadata referring to one andthe same knowledge store.• Most important, it should be possible theownership and the responsibility for themetadata and the knowledge to be distributed. This way, different parties can develop and maintain separately thecontent, the metadata, and the knowledge.Based on the above arguments we propose decoupled representation and management of the documents, the metadata (annotations) and the formal knowledge (ontologies and instance data) as depicted on Fig. 2.2.1 Light-weight Upper Level OntologyWe will shortly advocate the appropriateness of using ontology for defining the entity types – those are the only wide accepted paradigm for management of open, sharable, and reusable knowledge. According to our view, light-weight ontology (poor on axioms) is sufficient for simple definition of the entity classes, their appropriate attributes, and relations. In the same time it allows more efficient and scalable management of the knowledge (compared the heavy-weight semantic approaches.) The ontology to support semantic annotation in a web context should address number of general classes which use to appear in texts in various domains. Describing these classes together with the most basic relations and attributes means that an upper-level ontology should be involved. The experience within number of projects4 demonstrates that “logically extensive” upper-level ontologies are extremely hard to agree on, build, maintain, understand, and use. This seems to provide enough evidence that a light-weight upper level ontology is necessary for semantic annotations.2.2 Knowledge Representation LanguageAccording to the analysis of ontology and knowledge representation languages and formats in [11] and other authors it becomes evident that there is no much consensus beyond RDF(S), see [4]. The latter is well established in the Semantic Web community as a knowledge representation and interchange language. The rich diversity of RDF(S) repositories, APIs and tools, forms a mature environment for development of systems grounded in RDF(S) representation of their ontological and knowledge resources. Because of the common acceptance of RDF(S) in the Semantic Web community, it would be easy to reuse the ontology and KB, as well as enrich them with domain-specific extensions. The new OWL (see [9]) standard offers clear, relatively consensual and backward-compatible path beyond RDF(S), but still lacks sufficient tool support. Our experience shows (see the section on KIM) that for the basic purposes of light-weight ontology definition and entity description, RDF(S) provides sufficient basic expressiveness. The most critical nice-to-have primitives (equality, transitive and symmetric relations, etc.) are well covered in OWL Lite – the simplest first level of OWL. So, we suggest that RDF(S) is used in a way which allows easy extension towards OWL – this means avoiding primitives and patterns not included in OWL, /2002/07/owl.2.3 Metadata Encoding and ManagementThe metadata has to be stored in a format allowing its efficient management; we are not going to prescribe a specific format here, but rather to outline number of principles and requirements towards the document and annotation management:4For instance, Cyc () and the Standard Upper Ontology initiative (/)•Documents (and other content) in different formats to be identifiable and their text content to be accessible;•To allow non-embedded annotations over documents to be stored, managed and retrieved according to their positions, features, and references to a KB;•To allow embedding of the annotations at least for some of the formats;•To allow export and exchange of the annotations in different formats.There are number of standards and initiatives related to encoding and representation of metadata related to text. Two of the most popular are TEI5 and Tipster6.2.4 Knowledge BaseOnce having the entity types, relations, and attributes encoded in an ontology, the next aspect of the semantic annotation representation are the entity descriptions. It should be possible to identify, describe and interconnect the entities in a general, flexible and standard fashion. We call a body of formal knowledge about entities a knowledge base (KB) – although a bit old-fashioned, this term reflects best the representation of non-ontological formal knowledge. A KB is expected to contain mostly instance knowledge/data, so, other names can also make a good fit for such dataset.We consider that the ontology (defining all classes, relations and attributes, together with further constraints and dependencies) is a sort of schema for the KB and both should be kept into a semantic store – any sort of formal knowledge reasoning and management system which provides the basic operations: storage and retrieval according to the syntax and semantics of the selected formalism. The store may or may not provide inference7, it can implement different reasoning strategies, etc. There are also more advanced management features which are not considered as a must: versioning, access control, transaction support, locking, client-caching. For an overview of those see [16], [15], [19] and [25]. Whether the ontology and knowledge base should be kept together – this is a matter of distributed knowledge representation and management which is outside the scope of this paper.The KB can host two sorts of entity knowledge (descriptions and relationships): •Pre-populated – such imported or otherwise acquired from trusted sources;•Automatically extracted – such discovered in the process of semantic annotation (say via IE) or using other knowledge discovery and acquisition methods such as data-mining.It is up to the specific implementation, whether or not and how much the KB to be pre-populated. For instance, information about entities of general importance (including their aliases) can significantly help the IE used for automatic semantic annotations – an extensive proposal about this can be found in the description of the KIM platform later on in this paper.5 The Text Encoding Initiative, /6 Tipster Architecture, /cs/faculty/grishman/tipster.html7 For instance, there are experts who do not consider as inference the interpretation of RDF(S) according to its model-theoretic semantics, just because this one is simple compared to semantic and the inference methods in other languages.Further, domain and task specific knowledge could help the customization of a semantic annotation application – after extending the ontology to match the appliance domain, the KB could be pre-populated with specific entities. For instance, information about specific markets, customers, products, technologies and competitors could be of a great help for business intelligence and press-clipping; for company intelligence within UK it would be important to have more exhaustive coverage of UK-based companies and UK locations. It might also appear beneficial to reduce the general information that is not applicable in the concrete context and thus construct a more focused KB.Since state of the art IE (and in particular named entity recognition, NER) allows recognition of new (previously unknown) entities and relations between them, it is reasonable to use this advantage for the enrichment of the KB. Because of the innate non-preciseness of these methods, the knowledge accumulated through them should be distinguishable from the one that was pre-populated. Thus the extraction of new metadata, can still be grounded in the trusted knowledge about the world, while the accumulated entities would be available for indexing, browsing and navigation. Recognized entities could be transformed to trusted ones at some point, through semi-automatic validation process. Important part of this enrichment would be the template extraction of entity relations, which could be referred to as some kind of content-based learning of the system. Depending on the texts that are being processed, the respective changes would occur in the recognized parts of the KB, and thus its projection of the world would change accordingly (e.g. processing only sport news articles, the metadata would be both rich for this domain and poor for the others.) 2.5. Unified Representation of Lexical KnowledgeThe symbolic IE processing usually requires some lexica to be used for pattern recognition and other purposes. These are both general entries (such as various sorts of stop words) as well as such specific for the entity classes being handled. It is common that IE systems keep these in application-specific formats or directly hard-coded in the source code.It’s worth to represent and manage those in the same format used for the ontology and the entity knowledge base – this way the same tools (parsers, editors, etc.) can be used to manage both sorts of knowledge. For this purpose, part of the ontology (or just a separate one) could be dedicated to defining the types of lexical resources used by the natural language technologies involved.The corresponding lexical resources part of the KB should be pre-populated to aid the IE process by providing clues for the entity and relation recognition, which goes beyond the already known instances. For instance, for efficient recognition of persons in the text one would need lists of first names (male and female), person titles, positions and professions. Some of these could be ontologically distinguishable by gender, as well. For the Organization lexica one should pre-populate possible suffixes (such as Ltd., GmbH, etc.), and terms appearing in the organization name (e.g. company, theatre, etc.). Additionally, time and date lexica (“a.m.”, “Tue”, etc.), currency units, address lexica and others should be included. The mature symbolic NER and IE systems already have coverage of such resources; the next step tointegrate them in a system for automatic semantic annotation would be just to encode them in a formal ontology and present them in the KB.3. Semantic Annotation ProcessAs already mentioned, we focus mainly on the automatic semantic annotation, leaving manual annotation to approaches more related to authoring web content. Even less accurate, the automatic approaches for metadata acquisition promise scalability and without them the Semantic Web will remain mostly a vision for long time. Our experience shows that the existing state-of-the-art IE systems have the potential to automate the annotation with reasonable accuracy and performance.Although a lot of research and development contributed in the area of automatic IE so far, the lack of standards and integration with formal knowledge management systems was obscuring its usage. We claim that it is crucial to encode the extracted knowledge formally and according to well known and widely accepted knowledge representation and metadata encoding standards. Such system should be easily extensible for domain-specific applications, providing basic means for addressing the most common entities types, their attributes, and relations.3.1 ExtractionIt is a major problem with the traditional NER approaches that the annotations produced are not encoded in an open formal system and unbound entity types are used. The resources used are also traditionally presented in a proprietary form with no clear semantics. This hinders the reuse of both lexical resources and the resulting annotations by other systems, thus limiting the progress of the language technologies, since effortless sharing of resources and results is too expensive.These problems can be partly resolved by an ontology-based infrastructure for IE. As proposed above, the entity types should be defined within an ontology, and the entities being recognized to be described (or at least kept) in accompanying KB. Thus the NLP systems with ontology support would more easily share both pre-populated knowledge and the results of their processing, as well, as all the different sorts of lexicons and other resources commonly used.An important case demonstrating how ontologies can be used in IE are the so-called gazetteers used to look-up in the text predefined strings out of predefined lists. At present, the lists are being kept in proprietary formats. Typical result of the work of the gazetteers are annotations with some unbound strings used as types. A better approach presumes all the various annotation types and list values to be kept in a semantic store. Thus, the resulting annotation can be typed by reference to ontology classes and even further, point to specific lexeme or entity, if appropriate.Since a huge amount of NLP research has been contributed in the recent years (and even decades), we suggest the reuse of existing systems with proven maturity, and effectiveness. Such system should be modified so to use resources kept in a KB and produce annotations referring to the latter. Our experience shows that such a change is not a trivial one. All the processing layers have to be re-engineered in order to getopened towards the semantic repository and depend on it for their inputs. However, there are number of benefits of such approach:•All the various sorts of resources can be managed in a much more standard and uniform way;•It becomes easier to manage the different sorts of linguistic knowledge at the proper level of generality. For instance, a properly structured entity type hierarchy would allow that the entities and their references in the text are classified in the most precise way, but still easily matched in more general patterns. Thus, one can have a specific mountain annotated and still match it within a grammar rule which expects any sort of location;•Wherever it is possible, any available further knowledge will be accessible directly with a reference from the annotation to the semantic store. Thus, available knowledge for an entity can be used for instance for disambiguation or co-reference resolution tasks.A processing layer that is not inherent to the traditional IE systems can generate and store in the KB the descriptions of the newly discovered entities. When the same entity is encountered in the text next time, it can be directly linked to the already generated description. Further, extending the IE task to cover template relations extraction, another layer could enrich the KB with these relations.3.2. Indexing and RetrievalHistorically, the issue of specific handling of the named entities was neglected by the information retrieval (IR) community, apart from some shallow handling for the purpose of Questions/Answering tasks. However, a recent large scale human interaction study on a personal content IR system of Microsoft (reported in [10]) demonstrates that, at least in some cases, the ignorance of the named entities does not match the user needs: “The most common query types in our logs were People/places/things, Computers/internet and Health/science. In the People/places thing category, names were especially prevalent. Their importance is highlighted by the fact that 25% of the queries involved people’s names suggesting that people are a powerful memory cue for personal content. In contrast, general informational queries are less prevalent.”As the web content is rapidly growing, the demand of more advanced retrieval methods increases accordingly. Based on semantic annotations, efficient indexing and retrieval techniques could be developed involving explicit handling of the named entity references.In a nutshell, the semantic annotations could be used to index both “NY” and “N.Y.” as occurrence of the specific entity “New York” like if there was just its unique ID. Because of no entity recognition involved, the present systems will index on “NY”, “N”, and “Y” which demonstrates well some of the problems with the keyword-based search engines.Given metadata indexing of the content, advanced semantic querying should be feasible. In a query towards a repository of semantically annotated documents, it should be possible to specify entity type restrictions, name and other attribute restrictions, as well as relations between the entities of interest. For instance, it should possible to make a query that targets all documents that refer to Persons that holdsome Positions within an Organization, and also restricts the names of the entities or some of their attributes (e.g. a person’s gender).Further, semantic annotations could be used to match specific references in the text to more general queries. For instance, a query such as “company ‘Redwood Shores’” could match documents mentioning the town and specific companies such as ORACLE and Symbian, but not the word “company”.Finally, although the above sketched enhancements look prominent, it still requires a lot of research and experiments to determine to what extent and how they could improve the existing IR systems. It is hard in a general context to predict how semantic indexing will combine with the symbolic and the statistical methods currently in use, such as the lexical approach presented in [20] and the latent semantic analysis presented in [18]. For this purpose, large scale experimental data and evaluation are required.4 KIM Platform: Implementing the VisionThe Knowledge and Information Management (KIM) platform embodies our vision of semantic annotation, indexing and retrieval services and infrastructure. An essential idea in KIM, is the semantic (or entity) annotation, (as depicted on Fig. 1). It can be seen as a classical named-entity recognition and annotation process. However, in contrast to most of the existing IE systems, KIM provides for each entity reference in the text (i) a link (URI) to the most specific class in the ontology and (ii) a link to the specific instance in the knowledge base. The latest is (to the best of our knowledge) an unique KIM feature which allows further indexing and retrieval of documents with respect to entities.For the end-user, the usage of a KIM-based application is straightforward and simple – requesting annotation from a browser plug-in, which highlights the entities in the current content and generates a hyperlink used for further exploring the available knowledge for the entity (as shown in Fig. 4). A semantic query web UI allows specification of a search query, that consists of entity type, name, attribute and relation restrictions (allowing queries such as Organization-locatedIn-Country, Person-hasPosition-Position-within-Organization, etc.)This section provides a short overview of the main components of KIM, which is presented in bigger details in [27] and on its web site, /kim.。