distributed information retrieval

合集下载

3. 液体化学品泄漏演习方案

附：参考演练方案1.目标Objective该“泄露演习方案”旨在提供一种可供参考的液体化学品/材料泄露控制与处置方案，包括泄露或释放源现场控制和补救措施。

各施工队伍可以此为参考检查自身在化学品泄露控制及应急响应程序和工具准备方面的完整性和可靠性。

The objective of the “spill drill scenario” is to test th e preparedness and the integrity of the spill team’s procedures and response, which includes stopping the source of the release, containing the release and commencing the remedial actions.2.定义DefinitionMSDS（物料安全数据表或化学品安全说明书）-是化学品生产商和进口商用来阐明化学品的理化特性（如闪点、易燃度、反应活性等）以及对使用者健康（如致癌、致畸等）可能产生的危害的一份文件。

MSDS (Material Safety Data Sheet)- A document containing information about a material or chemical including the chemical and generic name of its ingredients, the chemical and physical properties of the substance, health hazard information and precautions for safe use and handling.泄露处理包- 采用亲油性超细纤维无纺布制作，不含化学药剂，不会造成二次公害，能迅速吸收本身重量数十倍的油污、有机溶剂、碳氢化合物、植物油等液体的处理材料（以吸油棉为最常见）；用于在发生液体溢出或泄露时吸收外溢液体控制进一步扩散。

信息检索结课论文

信息检索结课论文题目：基于网络的信息检索应用研究学院：计算机科学与工程学院专业：软件工程学生：学号：授课教师：基于网络的信息检索应用研究王扬波(大学计算机学院电子与通信工程)摘要：网络信息检索一般指因特网检索，是通过网络接口软件，用户可以在一终端查询各地上网的信息资源。

这一类检索系统都是基于互联网的分布式特点开发和应用的，即：数据分布式存储，大量的数据可以分散存储在不同的效劳器上；用户分布式检索，任何地方的终端用户都可以访问存储数据；数据分布式处理，任何数据都可以在网上的任何地方进展处理。

本文对基于网络的信息检索应用进展研究，并分析了其局限。

关键词：信息检索；网络；分布式；Research on the application of informationbased on NetworkXX(xx)Abstract: network information retrieval generally refers to the Internet search, is through the network interface software, users can query the information resources in the Internet in a terminal. This kind of retrieval system is based on the Internet. That is, the data can be distributed and stored in different servers. Users can access the storage data. Data can be processed in any part of the Internet. In this paper, we study the application of information retrieval based on network, and analyze the development trend.Key words: information retrieval; network; distributed;1 网络信息检索简介随着信息技术的飞速开展，信息已成为全社会的重要资源，对信息的占有程度及信息处理水平的先进程度已成为衡量一个国家或地区现代化程度的重要标志，而网络上丰富的信息在更大程度上改变了人们的工作和生活的方式。

信息检索关键词部分

信息检索关键词部分Key word第1章信息检索（Information Retrieval, IR）数据检索（data retrieval）相关性（relevance）推送（Push）超空间（hyperspace）拉出（pulling）⽂献逻辑表⽰（视图）（logical view of the document）检索任务（retrieval task 检索（retrieval ）过滤（filtering）全⽂本（full text）词⼲提取（stemming）⽂本操作（text operation）标引词（indexing term）信息检索策略（retrieval strategy）光学字符识别（Optical Character Recognition, OCR）跨语⾔（cross-language）倒排⽂档（inverted file）检出⽂献（retrieved document）相关度（likelihood）信息检索的⼈机交互界⾯（human-computer interaction, HCI）检索模型与评价（Retrieval Model & Evaluation）⽂本图像（textual images）界⾯与可视化（Interface & Visualization）书⽬系统（bibliographic system）多媒体建模与检索（Multimedia Modeling & Searching）数字图书馆（Digital Library）检索评价（retrieval evaluation）标准通⽤标记语⾔（Standard Generalized Markup Language, SGML）标引和检索（indexing and searching）导航（Navigation）并⾏和分布式信息检索（parallel and distribution IR）模型与查询语⾔（model and query language）导航（Navigation）有效标引与检索（efficient indexing and searching）第2章特别检索（ad hoc retrieval）过滤（filtering）集合论（set theoretic）代数（algebraic）概率（probabilistic 路由选择（routing）⽤户需求档（user profile）阙值（threshold）权值（weight）语词加权（term-weighting）相似度（similarity）相异度（dissimilarity）域建模（domain modeling）叙词表（thesaurus）扁平（flat）⼴义向量空间模型（generalized vector space model）神经元（neuron）潜语义标引模型（latent semantic indexing model）邻近结点（proximal node）贝叶斯信任度⽹络（Bayesian belief network）结构导向（structure guided）结构化⽂本检索（structured text retrieval, STR）推理⽹络（inference network）扩展布尔模型（extended Boolean model）⾮重叠链表（non-overlapping list）第3章检索性能评价（retrieval performance evaluation）会话（interactive session）查全率(R, Recall Ratio) 信息性（Informativeness）查准率(P, Precision Ratio) ⾯向⽤户（user-oriented）漏检率(O, Omission Ratio) 新颖率（novelty ratio）误检率(M, Miss Ratio) ⽤户负担（user effort）相对查全率（relative recall）覆盖率（coverage ratio）参考测试集（reference test collection）优劣程度（goodness）查全率负担（recall effort）主观性（subjectiveness）信息性测度（informativeness measure）第4章检索单元（retrieval unit）字母表（alphabet）分隔符（separator）复合性（compositional）模糊布尔（fuzzy Boolean）模式（pattern）SQL(Structured Query Language, 结构化查询语⾔) 布尔查询（Boolean query）参照（reference）半结合（semijoin）标签（tag）有序包含（ordered inclusion）⽆序包含（unordered inclusion）CCL(Common Command Language, 通⽤命令语⾔) 树包含（tree inclusion）布尔运算符（Boolean operator） searching allowing errors容错查询Structured Full-text relevance feedback 相关反馈Query Language (SFQL) （结构化全⽂查询语⾔） extended patterns扩展模式CD-RDx Compact Disk Read only Data exchange (CD-RDx)（只读磁盘数据交换）WAIS (⼴域信息服务系统Wide Area Information Service)visual query languages. 查询语⾔的可视化查询语法树（query syntax tree）第5章query reformulation 查询重构 query expansion 查询扩展 term reweighting 语词重新加权相似性叙词表（similarity thesaurus）User Relevance Feedback⽤户相关反馈 the graphical interfaces 图形化界⾯簇（cluster）检索同义词（searchonym） local context analysis局部上下⽂分析第6章⽂献（document）样式（style）元数据（metadata）Descriptive Metadata 描述性元数据 Semantic Metadata 语义元数据intellectual property rights 知识产权 content rating 内容等级digital signatures数字签名 privacy levels 权限electronic commerce电⼦商务都柏林核⼼元数据集（Dublin Core Metadata Element Set）通⽤标记语⾔（SGML，standard general markup language）机读⽬录记录（Machine Readable Cataloging Record, MARC）资源描述框架(Resource Document Framework, RDF) XML(eXtensible Markup Language, 可扩展标记语⾔) HTML（HyperText Markup Language, 超⽂本标记语⾔）Tagged Image File Format (TIFF标签图像⽂件格式)Joint Photographic Experts Group (JPEG) Portable Network Graphics (PNG新型位图图像格式)第7章分隔符（separator）连字符（hyphen）排除表（list of stopwords）词⼲提取（stemming）波特（porter）词库（treasury of words）受控词汇表（controlled vocabulary）索引单元（indexing component）⽂本压缩text compression 压缩算法compression algorithm注释（explanation）统计⽅法（statistical method）赫夫曼（Huffman）压缩⽐（compression ratio）数据加密Encryption 半静态的（semi-static）词汇分析lexical analysis 排除停⽤词elimination of stopwords第8章半静态（semi-static）191 词汇表（vocabulary）192事件表（occurrence）192 inverted files倒排⽂档suffix arrays后缀数组 signature files签名档块寻址（block addressing）193 索引点（index point）199起始位置（beginning）199 Vocabulary search词汇表检索Retrieval of occurrences 事件表检索 Manipulation of occurrences事件表操作散列变换（hashing）205 误检（false drop）205查询语法树（query syntax tree）207 布鲁特-福斯算法简称BF（Brute-Force）故障（failure）210 移位-或（shift-or）位并⾏处理（bit-parallelism）212顺序检索（sequential search）220 原位（in-place）227第9章并⾏计算（parallel computing） SISD （单指令流单数据流）SIMD （单指令流多数据流） MISD （多指令流单数据流）MIMD （多指令流多数据流）分布计算（distributed computing）颗粒度（granularity）231 多任务（multitasking）I/O（input/output）233 标引器（indexer）映射（map）233 命中列表（hit-list）全局语词统计值（global term statistics）线程（thread）算术逻辑单元（arithmetic logic unit, ALU 中介器（broker）虚拟处理器（virtual processor）240分布式信息检索(distributed information retrieval)249⽂献收集器（gatherer）主中介器（central broker）254第10章信息可视化（information visualization）图标（icon）260颜⾊凸出显⽰（color highlighting）焦点+背景（focus-plus-context）画笔和链接（brushing and linking）魔术透镜（magic lenses）移动镜头和调焦（panning and zooming）弹性窗⼝（elastic window）概述及细节信息（overview plus details）⾼亮⾊显⽰（highlight）信息存取任务（information access tasks）⽂献替代（document surrogate）常见问题(FAQ, Frequently Asked Question) 群体性推荐（social recommendation）上下⽂关键词（keyword-in-context, KWIC）伪相关反馈（pseudo-relevance feedback）重叠式窗⼝（overlapping window）⼯作集（working set）第11/12章多媒体信息检索（Multimedia Information Retrieval, MIR）超类（superclass）半结构化数据（semi-structured data）数据⽚（data blade）可扩充型系统（extensible type system）相交（intersect）动态服务器（dynamic server）叠加（overlaps）档案库服务器（archive server）聚集（center）逻辑结构（logical structure）词包含（contain word）例⼦中的查询（query by example）路径名（path-name）通过图像内容查询（Query by Image Content, QBIC）图像标题（image header）主要成分分析（Principal Component Analysis, PCA）精确匹配（exact match）潜语义标引（Latent Semantic Indexing, LSI）基于内容（content-based）范围查寻（Range Query）第13章exponential growth指数增长 Distributed data 数据的分布性volatile data 不稳定数据 redundant data 冗余数据Heterogeneous data异构数据分界点（cut point）373Centralized Architecture集中式结构收集器-标引器（crawler-indexer）373 Wanderers 漫步者 Walkers 步⾏者 Knowbots 知识机器⼈Distributed Architecture分布式结构 gatherers 收集器brokers 中介器 the query interface 查询界⾯the answer interface响应界⾯ PageRank ⽹页级别Crawling the Web漫游Web breadth-first ⼴度优先depth-first fashion 深度优先 Indices（index pl.）索引Web Directories ⽹络⽬录 Metasearchers元搜索引擎Teaching the User⽤户培训颗粒度（granularity）384超⽂本推导主题检索（Hypertext Included Topic Search, HITS）380 Specific queries专指性查询 Broad queries 泛指性查询Vague queries模糊查询 Searching using Hyperlinks使⽤超链接搜索Web Query Languages查询语⾔ Dynamic Search 动态搜索Software Agents 软件代理鱼式搜索（fish search）鲨鱼搜索（shark search）拉出/推送（pull/push）393门户（portal）395 Duplicated data 重复数据第14章联机公共检索⽬录（online public access catalog, OPAC）397化学⽂摘（Chemical Abstract, CA）399 ⽣物学⽂摘（Biological Abstract, BA）⼯程索引（Engineering Index,EI）国会图书馆分类法（Library of Congress Classification）408杜威⼗进分类法（Dewey Decimal Classification）408联机计算机图书馆中⼼（Online Computer Library Center, OCLC）409机读⽬录记录（Machine Readable Cataloging Record, MARC）409第15章NSF (National Science Foundation, 美国国家科学基⾦会)NSNA（National Aeronautics and Space Administration，美国航空航天局）数字图书馆创新项⽬（Digital Libraries Initiative, DLI）4155S（stream,信息流structure,结构space, 空间scenario, 场景society社会）416基于数字化对象标识符（Digital Object Identifier, DOI）420都柏林核⼼（Dublin Core, DC）430 数字图书馆（Digital Library, DL）资源描述框架(Resource Document Framework, RDF)431text encoding initiative (TEI) （⽂本编码创新项⽬）431v。

分布式数据库和应用程序【英文】

Distributed Databases and Applications
John Wieczorek Museum of Vertebrate Zoology, UC Berkeley
DiGIR
1
Distributed Databases

Multiple sources of data …under local control, …with concepts in common …and a desire to deliver data as part of a community.
DiGIR
11
Project Goals

To define a protocol for retrieving structured data from multiple, heterogeneous databases across the Internet To build a reference implementation of both provider and portal software using said protocol

LifeMapper Global Biodiversity Information Facility (GBIF)
DiGIR
7
Distributed vs. centralized

Multiple sources of data …under local control, …with concepts in common …and a desire to deliver data as part of a community
DiGIR

计算机科学与技术学院申请博士学位发表学术论文的规定(2008.9上网)

计算机科学与技术学院申请博士学位发表学术论文的规定根据《华中科技大学申请博士学位发表学术论文的规定》，我院博士研究生申请博士学位前，须按以下要求之一发表学术论文：1、A类、B类或学院规定的国际顶尖学术会议论文一篇；2、SCI期刊论文一篇，C类一篇，国内权威刊物一篇；3、SCI期刊论文一篇，国内权威刊物二篇；4、SCI期刊论文一篇，C类二篇。

A、B、C类期刊参照《华中科技大学期刊分类办法》中规定的计算机科学与技术及其它相关学科的期刊执行，其中C类含被EI检索的国际会议论文。

学院规定的国内权威刊物指中国科学、科学通报、Journal of computer Science and Technology、计算机学报、软件学报、计算机研究与发展、Fronties of computer Science in China、电子学报、自动化学报、通信学报、数学学报、应用数学学报、计算机辅助设计与图形学学报及其它相关学科的一级学会学报。

学位申请人发表或接收发表的学术论文中，至少有一篇是以外文全文在C类及以上刊物上发表。

学位申请人发表或被接收发表的学术论文必须是其学位论文的重要组成部分，是学位申请人在导师指导下独立完成的科研成果，以华中科技大学为第一署名单位，以申请人为第一作者（与导师共同发表的论文，导师为第一作者，申请人可以第二作者）。

对于“同等贡献作者”排名的认定，参照《华中科技大学期刊分类办法》（校人[2008]28号文）执行。

本规定自2008年入学博士生起执行。

本规定的解释和修改权属计算机科学与技术学院学位审议委员会。

华中科技大学计算机科学与技术学院学位审议委员会二○○八年九月一日为提高研究生培养质量、提高学术水平、促进国际学术交流，经计算机学院学位审议委员会研究决定，国际顶尖学术会议分为A、B两类，分类如下：一、A类1. International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)2. ACM Conference on Computer and Communication Security (CCS)3. USENIX Conference on File and Storage Techniques (FAST)4. International Symposium on High Performance ComputerArchitecture (HPCA)5. International Conference on Software Engineering (ICSE)6. International Symposium on Computer Architecture (ISCA)7. USENIX Conference on Operating System and Design (OSDI)8. ACM SIGCOMM Conference (SIGCOMM)9. ACM Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR)10. International Conference on Management of Data and Symposium onPrinciples of Database Systems (SIGMOD/PODS)11. ACM Symposium on Operating Systems Principles (SOSP)12. Annual ACM Symposium on Theory of Computing (STOC)13. USENIX Annual Technical Conference (USENIX)14. ACM International Conference on Virtual Execution Environments(VEE)15. International Conference on Very Large Data Bases (VLDB)二、B类1. International Conference on Dependable Systems and Networks (DSN)2. IEEE Symposium on Foundations of Computer Science (FOCS)3. IEEE International Symposium on High Performance DistributedComputing (HPDC)4. International Conference on Distributed Computing Systems (ICDCS)5. International Conference on Data Engineering (ICDE)6. IEEE International Conference on Network Protocols (ICNP)7. ACM International Conference on Supercomputing (ICS)8. International Joint Conference on Artificial Intelligence (IJCAI)9. IEEE Conference on Computer Communications (INFOCOM)10. ACM SIGKDD International Conference on Knowledge Discovery andData Mining (KDD)11. Annual IEEE/ACM International Symposium on Microarchitecture(MICRO)12. ACM/IFIP/USENIX International Middleware Conference (Middleware)13. ACM International Conference on Multimedia (MM)14. ACM International Conference on Mobile Systems, Applications, andServices (MobiSys)15. ACM Conference on Programming Language Design andImplementation (PLDI)16. Annual ACM Symposium on Principles of Distributed Computing(PODC)17. ACM Symposium on Principles of Programming Languages (POPL)18. ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP)19. IEEE Real-Time Systems Symposium (RTSS)20. Supercomputing (SC'XY) Conference21. ACM Conference on Computer Graphics and Interactive Techniques(SIGGRAPH)22. ACM Conference on Measurement and Modeling of ComputerSystems (SIGMETRICS)23. IEEE Symposium on Security and Privacy (SP)24. Annual ACM Symposium on Parallel Algorithms and Architectures(SPAA)25. International World Wide Web Conference (WWW)华中科技大学计算机科学与技术学院学位审议委员会二○○八年十一月十七日计算机学院资助教师和学生参加顶尖国际学术会议试行办法院办字[2006]06号为了促进计算机学院师生开展国际学术交流，提高学术水平，经第75次学院办公会议研究，并经第2次教授咨询委员会咨询，学院资助教师和在校学生参加顶尖国际学术会议，制定本办法。

文献检索练习册

第一章单项选择题1.根据国家相关标准，文献的定义是指“记录有（知识）的一起载体。

2.以作者本人取得的成果为依据二创作的论文、报告等，并经公开发表或出版的各种文献，称为（一次文献）3.文摘、题录、目标等属于（二次文献）。

4.手稿、私人笔记等属于（零次）文献，辞典、手册等属于（三次）文献。

5.按照出版时间的先后。

应将各个级别的文献排列成（一次文献、二次文献、三次文献）。

6.（二次文献）的主要功能是检索、通报、控制一次文献、帮助人们在较短时间内获取较多的文献信息。

7.一次文献、二次文献、三次文献是按照（加工深度）进行区分的。

8.从文献的（载体类型）角度区分，可将文献分为印刷型、缩微型等。

9具有固定名称、同意出版形式和一定出版规律的定期或不定期的连续出版物，称为（期刊）。

10.（期刊）类型的专业文献出版周期最短、发行量最大、报道最迅速及时，成为多数论文发表渠道。

11.在公开版物中，当前的（报纸文献）反映的信息内容可能最新。

12.（档案文献）不属于公共出版物。

13.根据文后参考文献信息区别期刊和图书，主要依据是判断有无（卷期号）特征词，若有则为期刊。

14.根据文后参考文献信息区别图书和会议文献，主要依据是判断有无（出版社）特征词，若有则为会议。

15.根据布拉德福文献分散定律，阅读（核心期刊）文章是一种有效的情报获取方法。

16.核心期刊的期刊影响因子具有（学科性、学术性、时间性）特点。

17.在文献信息传递的载体类型中，（印刷型）是历史最悠久的文献形式。

18.一次文献的出版类型有（期刊）等。

19.二次文献又可称为（检索工具），它报道文献的（主要信息及其来源出处）。

20.当我们需要对陌生人知识作一般了解时，我们可先参考（图书）文献。

21.从载体的物理形态区分，（电子型）文献是文献发展的方向。

22.（期刊）提供的信息相对比较新颖、及时、可靠、专深。

23.要了解本专业的国内核心期刊，可参考（《中文核心期刊要目总览》）。

文献出版类型

文献的出版类型介绍根据文献的出版类型不同，文献可分为：图书报纸科技报告专利文献标准文献学位论文会议文献档案政府出版物期刊1图书(books, monographs)：论述或介绍某一学科或领域知识的出版物。

1.1内容特点:比较成熟,系统全面,基本知识1.2标准著录格式：作者. 书名. 版本（第1版不写）. 出版地: 出版者，出版年，页码1.3判别依据：出版地：出版者例如：(1) Etten V W.Fundamentals of optical fiber communication. London: Prentice-Hall,1991(2) 蒋永新主编. 自然科学技术信息检索教程.2版. 上海：上海大学出版社，2006.1.4实际情况：版次如：second edition,3rd edition编者ed.ISBN号出版商：press, publisher，Housepublishing company等如：GashS ed. Effective Literature Searching for Research. 2nd. Hampshire:Gower House,1999析出文献G.R. Mettam, L.B. Adams, New Discovery, in: B.S. Jones, R.Z. Smith (Eds.), Introduction to the Electronic Age, E-Publishing, Inc. New Y ork,1994, pp. 281-304.2期刊(journal, transactions)：指有固定名称、统一出版形式和一定出版规律的定期或不定期的连续出版物。

2.1内容特点:新颖2.2标准著录格式：著者.文章篇名. 刊名，出版年，卷号，期号，起止页码2.3判别依据：刊名，卷号，期号例：T ohyama H. A plasma Image bar for an electrophoto-graphic printer. Journal of the Imaging Science, 1991, vol.35no.5,330-3 (J.Imag. Sci., 1991, 35(5):330-3)(Journal of the Imaging Science, 1991, v.35n.5,330-3)期刊(journal, transactions)刊名中可能出现的词：Journal(J.)Transaction(Trans.)Letter(Lett.)Annual (Ann.)3会议文献(conference)：在国际和国内重要的学术或专业性会议上宣读发表的论文、报告。

国际计算机会议与期刊分级列表

Computer Science Department Conference RankingsSome conferences accept multiple categories of papers. The rankingsbelow are for the most prestigious category of paper at a givenconference. All other categories should be treated as "unranked".AREA: Artificial Intelligence and Related SubjectsRank 1:IJCAI: Intl Joint Conf on AIAAAI: American Association for AI National ConferenceICAA: International Conference on Autonomous Agents（现改名为AAMAS） CVPR: IEEE Conf on Comp Vision and Pattern RecognitionICCV: Intl Conf on Computer VisionICML: Intl Conf on Machine LearningKDD: Knowledge Discovery and Data MiningKR: Intl Conf on Principles of KR & ReasoningNIPS: Neural Information Processing SystemsUAI: Conference on Uncertainty in AIACL: Annual Meeting of the ACL (Association of Computational Linguistics) Rank 2:AID: Intl Conf on AI in DesignAI-ED: World Conference on AI in EducationCAIP: Inttl Conf on Comp. Analysis of Images and PatternsCSSAC: Cognitive Science Society Annual ConferenceECCV: European Conference on Computer VisionEAI: European Conf on AIEML: European Conf on Machine LearningGP: Genetic Programming ConferenceIAAI: Innovative Applications in AIICIP: Intl Conf on Image ProcessingICNN/IJCNN: Intl (Joint) Conference on Neural NetworksICPR: Intl Conf on Pattern RecognitionICDAR: International Conference on Document Analysis and RecognitionICTAI: IEEE conference on Tools with AIAMAI: Artificial Intelligence and MathsDAS: International Workshop on Document Analysis SystemsWACV: IEEE Workshop on Apps of Computer VisionCOLING: International Conference on Computational LiguisticsEMNLP: Empirical Methods in Natural Language ProcessingRank 3:PRICAI: Pacific Rim Intl Conf on AIAAI: Australian National Conf on AIACCV: Asian Conference on Computer VisionAI*IA: Congress of the Italian Assoc for AIANNIE: Artificial Neural Networks in EngineeringANZIIS: Australian/NZ Conf on Intelligent Inf. SystemsCAIA: Conf on AI for ApplicationsCAAI: Canadian Artificial Intelligence ConferenceASADM: Chicago ASA Data Mining Conf: A Hard Look at DMEPIA: Portuguese Conference on Artificial IntelligenceFCKAML: French Conf on Know. Acquisition & Machine LearningICANN: International Conf on Artificial Neural NetworksICCB: International Conference on Case-Based ReasoningICGA: International Conference on Genetic AlgorithmsICONIP: Intl Conf on Neural Information ProcessingIEA/AIE: Intl Conf on Ind. & Eng. Apps of AI & Expert SysICMS: International Conference on Multiagent SystemsICPS: International conference on Planning SystemsIWANN: Intl Work-Conf on Art & Natural Neural NetworksPACES: Pacific Asian Conference on Expert SystemsSCAI: Scandinavian Conference on Artifical IntelligenceSPICIS: Singapore Intl Conf on Intelligent SystemPAKDD: Pacific-Asia Conf on Know. Discovery & Data MiningSMC: IEEE Intl Conf on Systems, Man and CyberneticsPAKDDM: Practical App of Knowledge Discovery & Data MiningWCNN: The World Congress on Neural NetworksWCES: World Congress on Expert SystemsINBS: IEEE Intl Symp on Intell. in Neural \& Bio SystemsASC: Intl Conf on AI and Soft ComputingPACLIC: Pacific Asia Conference on Language, Information and Computation ICCC: International Conference on Chinese ComputingOthers:ICRA: IEEE Intl Conf on Robotics and AutomationNNSP: Neural Networks for Signal ProcessingICASSP: IEEE Intl Conf on Acoustics, Speech and SPGCCCE: Global Chinese Conference on Computers in EducationICAI: Intl Conf on Artificial IntelligenceAEN: IASTED Intl Conf on AI, Exp Sys & Neural NetworksWMSCI: World Multiconfs on Sys, Cybernetics & InformaticsAREA: Hardware and ArchitectureRank 1:ASPLOS: Architectural Support for Prog Lang and OSISCA: ACM/IEEE Symp on Computer ArchitectureICCAD: Intl Conf on Computer-Aided DesignDAC: Design Automation ConfMICRO: Intl Symp on MicroarchitectureHPCA: IEEE Symp on High-Perf Comp ArchitectureRank 2:FCCM: IEEE Symposium on Field Programmable Custom Computing Machines SUPER: ACM/IEEE Supercomputing ConferenceICS: Intl Conf on SupercomputingISSCC: IEEE Intl Solid-State Circuits ConfHCS: Hot Chips SympVLSI: IEEE Symp VLSI CircuitsISSS: International Symposium on System SynthesisDATE: IEEE/ACM Design, Automation & Test in Europe ConferenceRank 3:ICA3PP: Algs and Archs for Parall ProcEuroMICRO: New Frontiers of Information TechnologyACS: Australian Supercomputing ConfUnranked:Advanced Research in VLSIInternational Symposium on System SynthesisInternational Symposium on Computer DesignInternational Symposium on Circuits and SystemsAsia Pacific Design Automation ConferenceInternational Symposium on Physical DesignInternational Conference on VLSI DesignAREA: ApplicationsRank 1:I3DG: ACM-SIGRAPH Interactive 3D GraphicsSIGGRAPH: ACM SIGGRAPH ConferenceACM-MM: ACM Multimedia ConferenceDCC: Data Compression ConfSIGMETRICS: ACM Conf on Meas. & Modelling of Comp SysSIGIR: ACM SIGIR Conf on Information RetrievalPECCS: IFIP Intl Conf on Perf Eval of Comp \& Comm SysWWW: World-Wide Web ConferenceRank 2:EUROGRAPH: European Graphics ConferenceCGI: Computer Graphics InternationalCANIM: Computer AnimationPG: Pacific GraphicsIEEE-MM: IEEE Intl Conf on Multimedia Computing and SysNOSSDAV: Network and OS Support for Digital A/VPADS: ACM/IEEE/SCS Workshop on Parallel \& Dist Simulation WSC: Winter Simulation ConferenceASS: IEEE Annual Simulation SymposiumMASCOTS: Symp Model Analysis \& Sim of Comp \& Telecom Sys PT: Perf Tools - Intl Conf on Model Tech \& Tools for CPENetStore - Network Storage SymposiumRank 3:ACM-HPC: ACM Hypertext ConfMMM: Multimedia ModellingDSS: Distributed Simulation SymposiumSCSC: Summer Computer Simulation ConferenceWCSS: World Congress on Systems SimulationESS: European Simulation SymposiumESM: European Simulation MulticonferenceHPCN: High-Performance Computing and NetworkingGeometry Modeling and ProcessingWISEDS-RT: Distributed Simulation and Real-time ApplicationsIEEE Intl Wshop on Dist Int Simul and Real-Time ApplicationsUn-ranked:DVAT: IS\&T/SPIE Conf on Dig Video Compression Alg \& Tech MME: IEEE Intl Conf. on Multimedia in EducationICMSO: Intl Conf on Modelling, Simulation and OptimisationICMS: IASTED Intl Conf on Modelling and SimulationAREA: System TechnologyRank 1:SIGCOMM: ACM Conf on Comm Architectures, Protocols & Apps INFOCOM: Annual Joint Conf IEEE Comp & Comm SocSPAA: Symp on Parallel Algms and ArchitecturePODC: ACM Symp on Principles of Distributed ComputingPPoPP: Principles and Practice of Parallel ProgrammingMassPar: Symp on Frontiers of Massively Parallel ProcRTSS: Real Time Systems SympSOSP: ACM SIGOPS Symp on OS PrinciplesSOSDI: Usenix Symp on OS Design and ImplementationCCS: ACM Conf on Comp and Communications SecurityIEEE Symposium on Security and PrivacyMOBICOM: ACM Intl Conf on Mobile Computing and Networking USENIX Conf on Internet Tech and SysICNP: Intl Conf on Network ProtocolsOPENARCH: IEEE Conf on Open Arch and Network ProgPACT: Intl Conf on Parallel Arch and Compil TechRank 2:CC: Compiler ConstructionIPDPS: Intl Parallel and Dist Processing SympIC3N: Intl Conf on Comp Comm and NetworksICPP: Intl Conf on Parallel ProcessingICDCS: IEEE Intl Conf on Distributed Comp SystemsSRDS: Symp on Reliable Distributed SystemsMPPOI: Massively Par Proc Using Opt InterconnsASAP: Intl Conf on Apps for Specific Array ProcessorsEuro-Par: European Conf. on Parallel ComputingFast Software EncryptionUsenix Security SymposiumEuropean Symposium on Research in Computer SecurityWCW: Web Caching WorkshopLCN: IEEE Annual Conference on Local Computer NetworksIPCCC: IEEE Intl Phoenix Conf on Comp & CommunicationsCCC: Cluster Computing ConferenceICC: Intl Conf on CommRank 3:MPCS: Intl. Conf. on Massively Parallel Computing SystemsGLOBECOM: Global CommICCC: Intl Conf on Comp CommunicationNOMS: IEEE Network Operations and Management SympCONPAR: Intl Conf on Vector and Parallel ProcessingVAPP: Vector and Parallel ProcessingICPADS: Intl Conf. on Parallel and Distributed SystemsPublic Key CryptosystemsIEEE Computer Security Foundations WorkshopAnnual Workshop on Selected Areas in CryptographyAustralasia Conference on Information Security and PrivacyInt. Conf on Inofrm and Comm. SecurityFinancial CryptographyWorkshop on Information HidingSmart Card Research and Advanced Application ConferenceICON: Intl Conf on NetworksIMSA: Intl Conf on Internet and MMedia SysNCC: Nat Conf CommIN: IEEE Intell Network WorkshopICME: Intl Conf on MMedia & ExpoSoftcomm: Conf on Software in Tcomms and Comp NetworksINET: Internet Society ConfWorkshop on Security and Privacy in E-commerceUn-ranked:PARCO: Parallel ComputingSE: Intl Conf on Systems EngineeringAREA: Programming Languages and Software EngineeringRank 1:POPL: ACM-SIGACT Symp on Principles of Prog LangsPLDI: ACM-SIGPLAN Symp on Prog Lang Design & ImplOOPSLA: OO Prog Systems, Langs and ApplicationsICFP: Intl Conf on Function ProgrammingJICSLP/ICLP/ILPS: (Joint) Intl Conf/Symp on Logic ProgICSE: Intl Conf on Software EngineeringFSE: ACM Conference on the Foundations of Software Engineering (inc: ESEC-FSE when held jointly)FM/FME: Formal Methods, World Congress/EuropeCAV: Computer Aided VerificationRank 2:CP: Intl Conf on Principles & Practice of Constraint ProgTACAS: Tools and Algos for the Const and An of SystemsESOP: European Conf on ProgrammingICCL: IEEE Intl Conf on Computer LanguagesPEPM: Symp on Partial Evalutation and Prog ManipulationSAS: Static Analysis SymposiumRTA: Rewriting Techniques and ApplicationsESEC: European Software Engineering ConfIWSSD: Intl Workshop on S/W Spec & DesignCAiSE: Intl Conf on Advanced Info System EngineeringITC: IEEE Intl Test ConfIWCASE: Intl Workshop on Cumpter-Aided Software EngSSR: ACM SIGSOFT Working Conf on Software ReusabilitySEKE: Intl Conf on S/E and Knowledge EngineeringICSR: IEEE Intl Conf on Software ReuseASE: Automated Software Engineering ConferencePADL: Practical Aspects of Declarative LanguagesISRE: Requirements EngineeringICECCS: IEEE Intl Conf on Eng. of Complex Computer SystemsIEEE Intl Conf on Formal Engineering MethodsIntl Conf on Integrated Formal MethodsFOSSACS: Foundations of Software Science and Comp StructRank 3:FASE: Fund Appr to Soft EngAPSEC: Asia-Pacific S/E ConfPAP/PACT: Practical Aspects of PROLOG/Constraint TechALP: Intl Conf on Algebraic and Logic ProgrammingPLILP: Prog, Lang Implentation & Logic ProgrammingLOPSTR: Intl Workshop on Logic Prog Synthesis & TransfICCC: Intl Conf on Compiler ConstructionCOMPSAC: Intl. Computer S/W and Applications ConfCSM: Conf on Software MaintenanceTAPSOFT: Intl Joint Conf on Theory & Pract of S/W DevWCRE: SIGSOFT Working Conf on Reverse EngineeringAQSDT: Symp on Assessment of Quality S/W Dev ToolsIFIP Intl Conf on Open Distributed ProcessingIntl Conf of Z UsersIFIP Joint Int'l Conference on Formal Description Techniques and Protocol Specification, Testing, And VerificationPSI (Ershov conference)UML: International Conference on the Unified Modeling LanguageUn-ranked:Australian Software Engineering ConferenceIEEE Int. W'shop on Object-oriented Real-time Dependable Sys. (WORDS)IEEE International Symposium on High Assurance Systems EngineeringThe Northern Formal Methods WorkshopsFormal Methods PacificInt. Workshop on Formal Methods for Industrial Critical SystemsJFPLC - International French Speaking Conference on Logic and Constraint ProgrammingL&L - Workshop on Logic and LearningSFP - Scottish Functional Programming WorkshopHASKELL - Haskell WorkshopLCCS - International Workshop on Logic and Complexity in Computer ScienceVLFM - Visual Languages and Formal MethodsNASA LaRC Formal Methods Workshop(1) FATES - A Satellite workshop on Formal Approaches to Testing of Software(1) Workshop On Java For High-Performance Computing(1) DSLSE - Domain-Specific Languages for Software Engineering(1) FTJP - Workshop on Formal Techniques for Java Programs(*) WFLP - International Workshop on Functional and (Constraint) Logic Programming(*) FOOL - International Workshop on Foundations of Object-Oriented Languages(*) SREIS - Symposium on Requirements Engineering for Information Security(*) HLPP - International workshop on High-level parallel programming and applications(*) INAP - International Conference on Applications of Prolog(*) MPOOL - Workshop on Multiparadigm Programming with OO Languages(*) PADO - Symposium on Programs as Data Objects(*) TOOLS: Int'l Conf Technology of Object-Oriented Languages and Systems(*) Australasian Conference on Parallel And Real-Time SystemsAREA: Algorithms and TheoryRank 1:STOC: ACM Symp on Theory of ComputingFOCS: IEEE Symp on Foundations of Computer ScienceCOLT: Computational Learning TheoryLICS: IEEE Symp on Logic in Computer ScienceSCG: ACM Symp on Computational GeometrySODA: ACM/SIAM Symp on Discrete AlgorithmsSPAA: ACM Symp on Parallel Algorithms and ArchitecturesPODC: ACM Symp on Principles of Distributed ComputingISSAC: Intl. Symp on Symbolic and Algebraic ComputationCRYPTO: Advances in CryptologyEUROCRYPT: European Conf on CryptographyRank 2:CONCUR: International Conference on Concurrency TheoryICALP: Intl Colloquium on Automata, Languages and ProgSTACS: Symp on Theoretical Aspects of Computer ScienceCC: IEEE Symp on Computational ComplexityWADS: Workshop on Algorithms and Data StructuresMFCS: Mathematical Foundations of Computer ScienceSWAT: Scandinavian Workshop on Algorithm TheoryESA: European Symp on AlgorithmsIPCO: MPS Conf on integer programming & comb optimization LFCS: Logical Foundations of Computer ScienceALT: Algorithmic Learning TheoryEUROCOLT: European Conf on Learning TheoryWDAG: Workshop on Distributed AlgorithmsISTCS: Israel Symp on Theory of Computing and SystemsISAAC: Intl Symp on Algorithms and ComputationFST&TCS: Foundations of S/W Tech & Theoretical CSLATIN: Intl Symp on Latin American Theoretical InformaticsRECOMB: Annual Intl Conf on Comp Molecular BiologyCADE: Conf on Automated DeductionIEEEIT: IEEE Symposium on Information TheoryAsiacryptRank 3:MEGA: Methods Effectives en Geometrie AlgebriqueASIAN: Asian Computing Science ConfCCCG: Canadian Conf on Computational GeometryFCT: Fundamentals of Computation TheoryWG: Workshop on Graph TheoryCIAC: Italian Conf on Algorithms and ComplexityICCI: Advances in Computing and InformationAWTI: Argentine Workshop on Theoretical InformaticsCATS: The Australian Theory SympCOCOON: Annual Intl Computing and Combinatorics ConfUMC: Unconventional Models of ComputationMCU: Universal Machines and ComputationsGD: Graph DrawingSIROCCO: Structural Info & Communication ComplexityALEX: Algorithms and ExperimentsALG: ENGG Workshop on Algorithm EngineeringLPMA: Intl Workshop on Logic Programming and Multi-Agents EWLR: European Workshop on Learning RobotsCITB: Complexity & info-theoretic approaches to biologyFTP: Intl Workshop on First-Order Theorem Proving (FTP)CSL: Annual Conf on Computer Science Logic (CSL)AAAAECC: Conf On Applied Algebra, Algebraic Algms & ECC DMTCS: Intl Conf on Disc Math and TCSUn-ranked:Information Theory WorkshopAREA: Data BasesRank 1:SIGMOD: ACM SIGMOD Conf on Management of DataPODS: ACM SIGMOD Conf on Principles of DB SystemsVLDB: Very Large Data BasesICDE: Intl Conf on Data EngineeringICDT: Intl Conf on Database TheoryRank 2:SSD: Intl Symp on Large Spatial DatabasesDEXA: Database and Expert System ApplicationsFODO: Intl Conf on Foundation on Data OrganizationEDBT: Extending DB TechnologyDOOD: Deductive and Object-Oriented DatabasesDASFAA: Database Systems for Advanced ApplicationsCIKM: Intl. Conf on Information and Knowledge ManagementSSDBM: Intl Conf on Scientific and Statistical DB MgmtCoopIS - Conference on Cooperative Information SystemsER - Intl Conf on Conceptual Modeling (ER)Rank 3:COMAD: Intl Conf on Management of DataBNCOD: British National Conference on DatabasesADC: Australasian Database ConferenceADBIS: Symposium on Advances in DB and Information SystemsDaWaK - Data Warehousing and Knowledge DiscoveryRIDE WorkshopIFIP-DS: IFIP-DS ConferenceIFIP-DBSEC - IFIP Workshop on Database SecurityNGDB: Intl Symp on Next Generation DB Systems and AppsADTI: Intl Symp on Advanced DB Technologies and IntegrationFEWFDB: Far East Workshop on Future DB SystemsMDM - Int. Conf. on Mobile Data Access/Management (MDA/MDM)ICDM - IEEE International Conference on Data MiningVDB - Visual Database SystemsIDEAS - International Database Engineering and Application SymposiumOthers:ARTDB - Active and Real-Time Database SystemsCODAS: Intl Symp on Cooperative DB Systems for Adv AppsDBPL - Workshop on Database Programming LanguagesEFIS/EFDBS - Engineering Federated Information (Database) SystemsKRDB - Knowledge Representation Meets DatabasesNDB - National Database Conference (China)NLDB - Applications of Natural Language to Data BasesKDDMBD - Knowledge Discovery and Data Mining in Biological Databases Meeting FQAS - Flexible Query-Answering SystemsIDC(W) - International Database Conference (HK CS)RTDB - Workshop on Real-Time DatabasesSBBD: Brazilian Symposium on DatabasesWebDB - International Workshop on the Web and DatabasesWAIM: Interational Conference on Web Age Information Management(1) DASWIS - Data Semantics in Web Information Systems(1) DMDW - Design and Management of Data Warehouses(1) DOLAP - International Workshop on Data Warehousing and OLAP(1) DMKD - Workshop on Research Issues in Data Mining and Knowledge Discovery (1) KDEX - Knowledge and Data Engineering Exchange Workshop(1) NRDM - Workshop on Network-Related Data Management(1) MobiDE - Workshop on Data Engineering for Wireless and Mobile Access(1) MDDS - Mobility in Databases and Distributed Systems(1) MEWS - Mining for Enhanced Web Search(1) TAKMA - Theory and Applications of Knowledge MAnagement(1) WIDM: International Workshop on Web Information and Data Management(1) W2GIS - International Workshop on Web and Wireless Geographical Information Systems * CDB - Constraint Databases and Applications* DTVE - Workshop on Database Technology for Virtual Enterprises* IWDOM - International Workshop on Distributed Object Management* IW-MMDBMS - Int. Workshop on Multi-Media Data Base Management Systems* OODBS - Workshop on Object-Oriented Database Systems* PDIS: Parallel and Distributed Information SystemsAREA: MiscellaneousRank 1:Rank 2:AMIA: American Medical Informatics Annual Fall SymposiumDNA: Meeting on DNA Based ComputersRank 3:MEDINFO: World Congress on Medical InformaticsInternational Conference on Sequences and their ApplicationsECAIM: European Conf on AI in MedicineAPAMI: Asia Pacific Assoc for Medical Informatics ConfSAC: ACM/SIGAPP Symposium on Applied ComputingICSC: Internal Computer Science ConferenceISCIS: Intl Symp on Computer and Information SciencesICSC2: International Computer Symposium ConferenceICCE: Intl Conf on Comps in EduEd-MediaWCC: World Computing CongressPATAT: Practice and Theory of Automated TimetablingNot Encouraged (due to dubious referee process):International Multiconferences in Computer Science -- 14 joint int'l confs.SCI: World Multi confs on systemics, sybernetics and informaticsSSGRR: International conf on Advances in Infrastructure for e-B, e-Edu and e-Science and e-MedicineIASTED conferences以下是期刊：IEEE/ACM TRANSACTIONS期刊系列一般都被公认为领域顶级期刊，所以以下列表在关于IEEE/ACM TRANSACTIONS的分类不一定太准确。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Using query logs to establish vocabularies indistributed information retrievalMilad Shokouhi *,Justin Zobel,Saied Tahaghoghi,Falk ScholerSchool of Computer Science and Information Technology,RMIT University,Melbourne 3001,AustraliaReceived 11December 2005;accepted 3April 2006Available online 7July 2006AbstractUsers of search engines express their needs as queries,typically consisting of a small number of terms.The resulting search engine query logs are valuable resources that can be used to predict how people interact with the search system.In this paper,we introduce two novel applications of query logs,in the context of distributed information retrieval.First,we use query log terms to guide sampling from uncooperative distributed collections.We show that while our sampling strategy is at least as eﬃcient as current methods,it consistently performs better.Second,we propose and evaluate a prun-ing strategy that uses query log information to eliminate terms.Our experiments show that our proposed pruning method maintains the accuracy achieved by complete indexes,while decreasing the index size by up to 60%.While such pruning may not always be desirable in practice,it provides a useful benchmark against which other pruning strategies can be measured.Ó2006Elsevier Ltd.All rights reserved.Keywords:Distributed information retrieval;Uncooperative environments;Indexing;Query logs1.IntroductionTraditional information retrieval systems use corpus,document,and query statistics to identify likely answers to users’queries.However,these queries can be captured in a query log,providing an additional source of evidence of relevance.In recent years,considerable attention has been devoted to the study of query logs and the way people express their information needs (de Moura et al.,2005;Fagni,Perego,Silvestri,&Orlando,2006;Jansen &Spink,2005).The query logs of commercial search engines such as Excite 1(Spink,Wolfram,Jansen,&Saracevic,2001),Altavista 2(Silverstein,Marais,Henzinger,&Moricz,1999),and Allthe-Web 3(Jansen &Spink,2005)have been investigated and analysed.Query logs have been used in information0306-4573/$-see front matter Ó2006Elsevier Ltd.All rights reserved.doi:10.1016/j.ipm.2006.04.003*Corresponding author.E-mail address:milad@.au (M.Shokouhi).1 .2 .3.Information Processing and Management 43(2007)169–180170M.Shokouhi et al./Information Processing and Management43(2007)169–180retrieval research for applications such as query expansion(Billerbeck,Scholer,Williams,&Zobel,2003;Cui, Wen,Nie,&Ma,2002),contextual text retrieval(Wen,Lao,&Ma,2004),and image retrieval(Hoi&Lyu, 2004).The question we explore in this paper is how query logs can be used to guide future search,in the con-text of distributed information retrieval.In distributed information retrieval(DIR)systems,the task is to search a group of separate collections and identify the most likely answers from a subset of these.Brokers receive queries from the users and send them to those collections that are deemed most likely to contain relevant answers.In a cooperative environment,col-lections inform brokers about the information they contain by providing information such as term distribu-tion statistics.In uncooperative environments,on the other hand,collections do not provide any information about their content to brokers.A technique that can be used to obtain information about collections in such environments is to send probe queries to each rmation gathered from the limited number of answer documents that a collection provides in response to such queries is used to construct a representation set;this representation set guides the evaluation of user queries.In this paper,we introduce two novel applications of query logs:sampling for improved query probing,and pruning of index information.Theﬁrst of these is ing a TREC web crawl,we show that query log terms can be used produce eﬀective samples from uncooperative collections.We compare the performance of our strategy with the state-of-art method,and show that samples obtained using query log terms allow for more eﬀective collection selec-tion and retrieval performance–improvements in average precision are often over50%.Our method is at least as eﬃcient as current sampling methods,and can be much more eﬃcient for some collections.Our second new use of query logs is a pruning strategy that uses query log terms to remove less signiﬁcant terms from collection representation sets.For a DIR environment with a large number of collections,the total size of collection representation sets on the broker might become impractically large.The goal of pruning methods is to eliminate unimportant terms from the index without harming retrieval performance.In previous work–such as that of Carmel et al.(2001),Craswell,Hawking,and Thistlewaite(1999),de Moura et al. (2005)and Lu and Callan(2002)–pruning strategies have had an adverse eﬀect on performance.The reason is that these approaches drop many terms that are necessary to future queries.We show that pruning based on query logs does not decrease search precision.In addition,our method can be applied during document index-ing,which means that it independent of term frequency statistics.We also test our method on central indexes and for diﬀerent types of search tasks.We show that,by applying our pruning strategy,the same performance as a full index can be achieved,while substantially reducing index size.In practice,such pruning might not always be desirable;if a term is present,it should be searchable.However,our pruning does provide an inter-esting benchmark against which other methods can be measured,and is clearly superior to the principal alternative.2.Distributed searchThe vast volume of data on the web makes it extremely costly for a single search engine to provide com-prehensive coverage.Moreover,public search engines cannot crawl and index documents to which there are no public links,or from which crawlers are forbidden.These documents form the so-called hidden web and generally only be viewed by using custom search interfaces supplied as part of the site.Distributed information retrieval(DIR)aims to address this issue by passing queries to multiple servers through a central broker.Each server sends its top-ranked answers back to the broker,which produces a single ranked list of answers for pre-sentation to the user.For eﬃciency,the broker usually passes the query only to a subset of available servers, selecting those that are most likely to contain relevant answers.To identify the appropriate servers,the broker calculates a similarity between the query and the representation set of each server.In cooperative environments,servers provide the broker with their representation sets(Callan,Lu,&Croft, 1995;Fuhr,1999;Gravano,Chang,Garcia-Molina,&Paepcke,1997,1999;Yuwono&Lee,1997).The bro-ker can be aware of the distribution of terms at the servers,and is therefore able to calculate weights for each server.Queries are sent to those servers that indicate the highest weight for the query terms.In practice,servers may be uncooperative and therefore do not publish their index information.Server rep-resentation sets can be gathered using query-based sampling(QBS)(Callan,Connell,&Du,1999).In QBS,anM.Shokouhi et al./Information Processing and Management43(2007)169–180171initial query is created from frequently-occurring terms found in a reference collection–to increase the chance of receiving an answer–and sent to the server.The query results provided by the server are downloaded,and another query is created using randomly-selected terms from these results.This process continues until a suf-ﬁcient number of documents have been downloaded(Callan&Connell,2001;Callan et al.,1999;Shokouhi, Scholer,&Zobel,2006).Many queries do not return any yet-unseen answers;Ipeirotis and Gravano(2002) claim that,on average,one new document is received per two queries.QBS also downloads many pages that are not highly representative for the server.2.1.Query-based samplingQuery-based sampling QBS was introduced by Callan et al.(1999),who suggested that even a small number of documents(such as300)obtained by random sampling can eﬀectively represent the collection held at a ser-ver.They tested their method on the CACM collection(Jones&Rijsbergen,1976)and many other small servers artiﬁcially created from TREC newswire data(Voorhees&Harman,2000).In QBS,subsequent queries after the ﬁrst are selected by choosing terms from documents that have been downloaded so far(Callan et al.,1999). Various methods were explored;random selection of query terms was found to be the most eﬀective way of choosing probe queries,and this method has since been used in other work on sampling non-cooperative serv-ers(Craswell,Bailey,&Hawking,2000;Si&Callan,2003).These methods generally proceed until aﬁxed number of documents(usually300)have been downloaded.However,Shokouhi et al.(2006)have shown that for more realistic,larger collections,ﬁxed-size samples might not be suitable,as the coverage of the vocabulary of the server is poor.An alternative technique,called Qprober(Gravano,Ipeirotis,&Sahami,2003),has been proposed for auto-matic classiﬁcation of servers.Here,a classiﬁcation system is trained with a set of pages and judgments.Then the system suggests the classiﬁcation rules and uses the rules as queries.For example,if the classiﬁcation sys-tem suggests(Madonna!Music),it uses‘‘Madonna’’as a query and classiﬁes the downloaded pages as music-related.Qprober diﬀers from QBS in the way that probe queries are selected and requires a classiﬁcation system in the background.ing query logs for samplingTerms that appear in search engine query logs are–by deﬁnition–popular in queries,and tend to refer to topics that are well-represented in the collection.We therefore hypothesis that probe queries composed of query log terms would return more answers than the random terms,leading to higher eﬃciency.Since query terms are aligned with actual user interests,we also believe that sampling using query log terms would better reﬂect user needs than random terms from downloaded documents.Hence,instead of choosing the terms from downloaded documents for probe queries,we use terms from query logs.Analysis of our method shows that it is at least as eﬃcient as previous methods,and generates samples that produce higher overall eﬀectiveness.2.3.EvaluationTo simulate a DIR environment,we extracted documents from the100largest servers in the TREC WT10g col-lection(Bailey,Craswell,&Hawking,2003).These sets vary in size from26505documents(www9.yahoo. com),to2790documents(),with an average size of5602documents per server.For sam-pling queries,we used the1000most frequent terms in the Excite search engine query logs collected in1997 (Spink et al.,2001).For each query,we download the top10answers;this is the number of results that most search interfaces return on theﬁrst page of results.Sampling stops after300unique documents have been downloaded or1000 queries have been sent(whichever comesﬁrst).Although usingﬁxed-size samples might not always be the opti-mal method(Shokouhi et al.,2006),we restrict ourselves to300documents to ensure that our results are com-parable to the widely accepted baseline(Callan et al.,1999).For each server we gather two samples:one by query-based sampling,and the other by our query log method.For query log(QL)experiments,each of the1000most frequent terms in the Excite query logs arepassed as a probe query to the collection,and the top 10returned answers are collected.For QBS ,probe queries are selected from the current downloaded documents at each time,and the top 10results of each query are gathered.To evaluate the eﬀectiveness of samples for diﬀerent queries,we used topics 451–550from the TREC -9and TREC 2001Web Tracks.We used only terms in the title ﬁeld as queries.Since we are extracting only the largest 100servers from WT 10g,the number of available relevant documents is low,so the precision-recall metrics produce poor results.For this reason,many DIR experiments use the set of documents that are retrieved by a central server as an oracle.That is,all of the top-ranked pages returned by the central index are considered to be relevant,and the performance of DIR approaches is evaluated based on how eﬀectively they can retrieve this set (Craswell et al.,2000;Xu &Callan,1998).Therefore,we use a central index contain-ing the documents of all 100servers 4as a benchmark.For both the baseline and DIR experiments,we gathered the top 10results for each query.Results for 100and 1000answers per query were found to be similar and are not presented here.We tested diﬀerent cutoﬀ(CO)points in our evaluations:for a cutoﬀof 1,the queries were passed to the one server with the most similar corresponding representation set;for a cutoﬀof 50,queries were sent to the top 50servers.Table 1shows that the QL method consistently produces better results.Diﬀerences that are statistically signiﬁcant based on the t -test at the 0.05and 0.01level of signiﬁcance are indicted by the and à,respectively.For mean average pre-cision (MAP),which is considered to be the most reliable evaluation metric (Sanderson &Zobel,2005),QL outperforms QBS signiﬁcantly in four of ﬁve cases.We made two key observations.First,query log (QL )terms did not retrieve the expected 300documents for four servers after 1000queries,while QBS failed to retrieve this number from only one server.Analysis showed that these servers contain documents unlikely to be of general interest to users.For example has error pages and HTML forms while includes many pages with non-text characters.Second,the QL method downloads an average of 2.43unseen documents per query,while the corresponding average for QBS is 2.80.Having access to the term document frequency information of any collection,it is possible to calculate the expected number of answers from the collection,for single-term queries extracted randomly from its index.Therefore,we indexed all of the servers together as a global collection.At most 10answers are retrieved per query.The expected number of answers per query can be calculated asj Number of Terms df >9j j Total Number of Terms j Â10þX 9i ¼1j Number of Terms df ¼i jj Total Number of Terms j Âiwhich gives an expected value of 2.60,close to numbers obtained by both theQLandQBSmethods.Table 1Comparison of the QLandQBSmethods on a subset of theWT 10g data;QLconsistently performs better CO MAPP@5P@10R -precisionQBSQL QBS QL QBSQLQBSQL10.06680.09020.13020.17210.07440.09880.07440.0988100.15620.2515à0.30570.4322à0.20110.3023à0.20110.3023à200.16170.2811à0.31490.4621à0.21150.3437à0.21150.3437à300.15400.2655à0.29410.4471à0.21060.3259à0.21060.3259à400.18120.2639à0.32000.4306à0.24590.3212à0.24590.3212à500.18680.4188à0.33410.4188 0.25060.3176à0.25060.3176àDiﬀerences that are statistically signiﬁcant based on the t -test at the 0.05and 0.01level of signiﬁcance are indicted by and à,respectively.‘‘CO’’is the cutoﬀnumber of servers from which answers are retrieved.4The 100servers consist of 563656documents in total,containing 309195668terms,1788711of them unique.172M.Shokouhi et al./Information Processing and Management 43(2007)169–180However,these values contrast with those reported by Ipeirotis and Gravano (2002),who claim that QBS downloads an average of only one unseen document per two queries.On further investigation,we observed that the average varies for diﬀerent collections,as shown in Table 2.The ﬁrst collection is extracted from TREC AP newswire data and contains newspaper articles.Collections labelled WEB are subsets of the TREC WT 10g collection (Bailey et al.,2003).Finally,GOV-1is a subset of the TREC GOV 1collection (Craswell &Hawking,2002).Note that the average values for QL are between 4.9and 7.4unseen documents per query,while for QBS these range from 1.2to 4.5.In general,the gap between methods is more signiﬁcant for larger collections with broad topics.Each QL probe query returns about 10answers –the maximum –on the ﬁrst page,while this number is considerably lower for QBS .3.Pruning using query logsIn uncooperative DIR systems,the broker keeps a representation sample for each collection (Callan &Con-nell,2001;Craswell et al.,2000;Shokouhi et al.,2006).These samples usually contain a small number of doc-uments downloaded by query-based sampling (Callan et al.,1999)from the corresponding collections.Pruning is the process of excluding unimportant terms (or unimportant term occurrences)from the index to reduce storage cost and,by making better use of memory,increase querying speed.The importance of a term can be calculated according to factors such as lexicon statistics (Carmel et al.,2001)or position in the document (Craswell et al.,1999).The major drawback with current pruning strategies is that they decrease precision,because the pruned index can miss terms that occur in user queries.In addition,in lexicon-based pruning strategies,the indexing process is slowed signiﬁcantly.First,documents need to be parsed,so that term distribution statistics are available.Then,unimportant terms can be identiﬁed,and excluded,based on the lexicon statistics.For example,terms that occur in a large proportion of documents might be treated as stopwords.Finally,the index needs to be updated based on the new pruned vocabulary statistics.Lexicon-based pruning strategies face additional problems when dealing with broker indexes in DIR .Doc-uments are gathered from diﬀerent collections,with diﬀerent vocabulary statistics.A term that appears unim-portant in one collection based on its occurrences might in fact be critical in another collection.Therefore,pruning the broker’s index based on the global lexicon statistics does not seem reasonable.We introduce a new pruning method that addresses these problems.Our method prunes during parsing,and is therefore faster than lexicon-based methods,as index updates are not required.Unlike other approaches,our proposed method does not harm precision,and can increase retrieval performance in some cases.Note,however,that we regard this pruning strategy as an illustration of the power of query logs rather than a method that should be deployed in practice:users for search for a term should be able to ﬁnd matches if they are present in the collection.Although sampling inevitably involves some loss,that loss should be mini-mised.That said,as our experiments show the new pruning method is both eﬀective and eﬃcient.3.1.Related workPruning is widely used for eﬃciency,either to increase query processing speed (Persin,Zobel,&Sacks-Davis,1996),or to save disk storage space (Carmel et al.,2001;Craswell et al.,1999;de Moura et al.,2005;Lu &Callan,2002).Table 2Comparison of the QLandQBSmethods,showing average number of answers returned per queryCollection Size Unseen (QBS )Total (QBS )Unseen (QL )Total (QL )Newswire 30507 4.6 5.8 4.99.1WEB-1304035 1.8 2.2 6.99.9WEB-2218489 2.5 2.9 6.69.9WEB-3817025 1.2 1.57.49.9GOV-11361762.53.87.39.7M.Shokouhi et al./Information Processing and Management 43(2007)169–180173174M.Shokouhi et al./Information Processing and Management43(2007)169–180 Carmel et al.(2001)proposed a pruning strategy where each indexed term is sent in turn as a query to their search system.Index information is discarded for those documents that contain the query term,but do not appear in the top ranked results in response to the query.This strategy is computationally expensive and time consuming.The soundness of this approach is unclear;the highly ranked pages for many queries are not highly ranked for any of the individual query terms,de Moura et al.(2005)have extended Carmel’s method. The apply Carmel’s method to extract the most important terms of each collection.Then they keep only those sentences that contain important terms,and delete the rest.For the same reason discussed previously,this approach also is not applicable in uncooperative DIR environments.Although this approach is more eﬀective than Carmel’s method in most cases,the loss in average precision compared to a full-text baseline is signiﬁcant.D’Souza,Thorn,and Zobel(2004)discuss surrogate methods for pruning,where only the most signiﬁcant words are kept for each document.In this approach,the representation set is not the collection’s vocabulary, but is instead a complete index for the surrogates.Such an approach requires a high level of cooperation between servers.Craswell et al.(1999)use a pruning strategy to merge the results gathered from multiple search engines.In their work,they download theﬁrst four kilobytes of each returned document instead of extracting the whole document for result merging.They showed that in some cases,the loss in performance is negligible.They only evaluated their method for result merging.In a comprehensive analysis of pruning in brokers,Lu and Callan(2002)divided pruning methods into var-ious groups:frequency-based methods prune documents according to lexicon statistics;location-based methods exclude terms based on the position of their appearance in the documents;single-occurrence methods set a pruning threshold based on the number of unique terms in documents,and keep one instance of each term in the document;and multiple-occurrence methods allow for multiple occurrences of terms in pruned docu-ments.Experiments evaluating the performance of nine methods demonstrated that four models can achieve similar optimal levels of performance,and do not have any signiﬁcant advantage over each other.Of these best methods,FIRSTM is the only one which does not rely on a broker’s vocabulary statistics.For each document,this approach stores information about theﬁrst1600terms.The other methods measure the importance of terms based on frequency information.As discussed,these methods are unsuitable for DIR in many ways:the frequency of a term in the broker does not indicate its importance in the original collections; the cost of pruning and re-indexing might be high;and,adding a new collection makes the current pruned index unusable,since after a new collection is added to the system,the previous information is no longer valid. The FIRSTM approach prunes during parsing,which makes it more comparable with our approach.Therefore, we evaluate our approach by using FIRSTM as a baseline.Lu and Callan tested their methods on100collections created from TREC disks1,2,and3,and showed that their models can reduce storage costs by54%–93%,with less than a10%loss in precision.We test our systems on WT10g and GOV1,which are larger and consist of unmanaged data crawled from the Web;these collections are described in more detail in Section3.2.Our proposed pruning method is applied during parsing,and is independent of index updates,as the addi-tion of new collections to the system does not require the re-indexing of the original documents.Moreover, our pruning method does not reduce system performance and precision,while in all of the discussed previous work(Carmel et al.,2001;Craswell et al.,1999;de Moura et al.,2005;Lu&Callan,2002),pruning results in a decrease in precision.ing query logs for pruningThe main motivation for pruning is to omit unimportant terms from the index.That is,pruning methods are intended to exclude the terms that are less likely to appear in user queries(de Moura et al.,2005).Some methods prune terms that are rare in documents(Lu&Callan,2002).However,the distribution of terms in user queries is not similar to that in typical web documents.We propose using the history of previous user queries to achieve this directly.Our hypothesis is that prun-ing those terms that do not appear in a search engine query logs will be able to reduce index sizes while main-taining retrieval performance.We test our hypothesis with experiments on distributed environments andM.Shokouhi et al./Information Processing and Management43(2007)169–180175 central indexes for diﬀerent types of queries.In a standard search environment,where completeness may be more important than improvements in eﬃciency,such pruning(or any pruning)is unappealing;but in a dis-tributed environment,where index information is incomplete and is diﬃcult to gather,such an approach has signiﬁcant promise.For our experiments,we used a list of the315936unique terms in the log of about one million queries submitted to the Excite search engine on16September1997(Spink et al.,2001).Larger query logs,or a com-bination of query logs from diﬀerent search engines,might be useful for larger collections.Also,for highly topic-speciﬁc collections,topical query logs(Beitzel,Jensen,Chowdhury,Grossman,&Frieder,2004)and query terms that have been classiﬁed into diﬀerent categories(Jansen,Spink,&Pedersen,2005)could provide additional beneﬁts.For experiments on uncooperative DIR environments and brokers,we used the testbed described in Section 2.3.The100largest servers were extracted from the TREC WT10g collection,with each server being considered as a separate collection.Query-based sampling,as described in Section2.1,was used to obtain representation sets for each collection by downloading300documents from each server in our testbed.We do not omit stop-words in any of our experiments.For each downloaded sample,we only retained information about those terms that were present in our query log,and eliminated the other terms from the broker.We used CORI (Callan et al.,1995)for collection selection and result merging;CORI has been used in many papers as a base-line(Craswell et al.,2000;Nottelmann&Fuhr,2003;Powell&French,2003;Si&Callan,2003).TREC topics451–550and their corresponding relevance judgements were used to evaluate the eﬀectiveness of our pruned representation sets.We use only theÆtitleæﬁeld of TREC topics as queries for the search system.To test our pruning method on central indexes,we used the TREC WT10g and GOV1collections.The GOV1 collection(Craswell&Hawking,2002)contains over a million documents crawled from domain. TREC topics451–550were used for our experiments on WT10g,and topics551–600and NP01–NP150were used with the GOV1collection.All experiments with central indexes use the OkapiBM25similarity measure (Robertson,Walker,Hancock-Beaulieu,Gull,&Lau,1992).In addition to these topic-ﬁnding search tasks,we evaluate our pruning approach on central indexes for named-pageﬁnding and topic distillation tasks.In topic distillation,the objective is toﬁnd relevant homepages related to a general topic(Craswell,Hawking,Wilkinson,&Wu,2003).We use the TREC topic distillation top-ics551–600,and corresponding relevance judgements,with the GOV1collection.For named-pageﬁnding(also known as homepageﬁnding)the aim is toﬁnd particular web pages of named individuals or organisations.To evaluate this type of search task,we used the TREC named-pageﬁnding queries NP01-NP150(Craswell& Hawking,2002).3.3.Distributed retrieval resultsThe results of our experiments using diﬀerent pruning methods for DIR systems are shown in Table3.For each scheme,up to1000answers were returned per query.The cutoﬀ(CO)values show the number of collec-tions that are selected for each query.That is,for theﬁrst row,only the best collection is selected while for theTable3Eﬀectiveness of diﬀerent pruning schemes,on a subset of WT10gCO MAP P@10R-precisionORIG FIRSTM PR ORIG FIRSTM PR ORIG FIRSTM PR10.01780.00960.01780.03670.03160.03670.02170.00710.0217 100.04150.03990.04150.06200.04680.06300.05930.04000.0594 200.03550.04680.05040.05900.04940.06460.04850.04600.0644 300.03990.04620.05460.06030.04810.0658 0.05000.04260.0659* 400.04890.05090.0628 0.06410.05610.0671 0.05210.04370.0611 500.05060.05160.0647 0.06540.05650.0684 0.05620.04390.0708* Signiﬁcance at the0.1and0.05levels of conﬁdence are indicated with*and ,respectively.‘‘CO’’is the cutoﬀnumber of servers from which answers are fetched.。