探索搜索引擎爬虫毕业论文外文翻译(可编辑)

探索搜索引擎爬虫毕业论文外文翻译(可编辑)
探索搜索引擎爬虫毕业论文外文翻译(可编辑)

外文译文正文:

探索搜索引擎爬虫随着网络难以想象的急剧扩张,从Web中提取知识逐渐成为一种受欢迎的途径。这是由于网络的便利和丰富的信息。通常需要使用基于网络爬行的搜索引擎来找到我们需要的网页。本文描述了搜索引擎的基本工作任务。概述了搜索引擎与网络爬虫之间的联系。

关键词:爬行,集中爬行,网络爬虫

导言在网络上是一种服务,驻留在链接到互联网的电脑上,并允许最终用户访问是用标准的接口软件的计算机中的存储数据。万维网是获取访问网络信息的宇宙,是人类知识的体现。搜索引擎是一个计算机程序,它能够从网上搜索并扫描特定的关键字,尤其是商业服务,返回的它们发现的资料清单,抓取搜索引擎数据库的信息主要通过接收想要发表自己作品的作家的清单或者通过“网络爬虫”、“蜘蛛”或“机器人”漫游互联网捕捉他们访问过的页面的相关链接和信息。

网络爬虫是一个能够自动获取万维网的信息程序。网页检索是一个重要的研究课题。爬虫是软件组件,它访问网络中的树结构,按照一定的策略,搜索并收集当地库中检索对象。本文的其余部分组织如下:第二节中,我们解释了Web爬虫背景细节。在第3节中,我们讨论爬虫的类型,在第4节中我们将介绍网络爬虫的工作原理。在第5节,我们搭建两个网络爬虫的先进技术。在第6节我们讨论如何挑选更有趣的问题。

调查网络爬虫网络爬虫几乎同网络本身一样古老。第一个网络爬虫,马修格雷浏览者,写于1993年春天,大约正好与首次发布的OCSA Mosaic网络同时发布。在最初的两次万维网会议上发表了许多关于网络爬虫的文章。然而,在当时,网络

i现在要小到三到四个数量级,所以这些系统没有处理好当今网络中一次爬网固有的缩放问题。显然,所有常用的搜索引擎使用的爬网程序必须扩展到网络的实质性部分。但是,由于搜索引擎是一项竞争性质的业务,这些抓取的设计并没有公开描述。有两个明显的例外:股沟履带式和网络档案履带式。不幸的是,说明这些文献中的爬虫程序是太简洁以至于能够进行重复。原谷歌爬虫(在斯坦福大学开发的)组件包括五个功能不同的运行流程。服务器进程读取一个URL出来然后通过履带式转发到多个进程。每个履带进程运行在不同的机器,是单线程的,使用异步I/O采用并行的模式从最多300个网站来抓取数据。爬虫传输下载的页面到一个能进行网页压缩和存储的存储服务器进程。然后这些页面由一个索引进程进行解读,从6>HTML页面中提取链接并将他们保存到不同的磁盘文件中。一个URL 解析器进程读取链接文件,并将相对的网址进行存储,并保存了完整的URL到磁盘文件然后就可以进行读取了。通常情况下,因为三到四个爬虫程序被使用,所有整个系统需要四到八个完整的系统。在谷歌将网络爬虫转变为一个商业成果之后,在斯坦福大学仍然在进行这方面的研究。斯坦福Web Base项目组已实施一个高性能的分布式爬虫,具有每秒可以下载50到100个文件的能力。Cho等人又发展了文件更新频率的模型以报告爬行下载集合的增量。互联网档案馆还利用多台计算机来检索网页。每个爬虫程序被分配到64个站点进行检索,并没有网站被分配到一个以上的爬虫。每个单线程爬虫程序读取到其指定网站网址列表的种子从磁盘到每个站点的队列,然后用异步I/O来从这些队列同时抓取网页。一旦一个页面下载完毕,爬虫提取包含在其中的链接。如果一个链接提到它被包含在页面中的网站,它被添加到适当的站点排队;否则被记录在磁盘。每隔一段时间,合并成一个批处理程序的具体地点的种子设置这些记录“跨网站”的网址,过滤掉进程

中的重复项。Web Fountian爬虫程序分享了魔卡托结构的几个特点:它是分布式的,连续,有礼貌,可配置的。不幸的是,写这篇文章,WebFountain是在其发展的早期阶段,并尚未公布其性能数据。

搜索引擎基本类型

基于爬虫的搜索引擎基于爬虫的搜索引擎自动创建自己的清单。计算机程序“蜘蛛”建立他们没有通过人的选择。他们不是通过学术分类进行组织,而是通过计算机算法把所有的网页排列出来。这种类型的搜索引擎往往是巨大的,常常能取得了大龄的信息,它允许复杂的搜索范围内搜索以前的搜索的结果,使你能够改进搜索结果。这种类型的搜素引擎包含了网页中所有的链接。所以人们可以通过匹配的单词找到他们想要的网页。

B.人力页面目录这是通过人类选择建造的,即他们依赖人类创建列表。他们以主题类别和科目类别做网页的分类。人力驱动的目录,永远不会包含他们网页所有链接的。他们是小于大多数搜索引擎的。

C.混合搜索引擎一种混合搜索引擎以传统的文字为导向,如谷歌搜索引擎,如雅虎目录搜索为基础的搜索引擎,其中每个方案比较操作的元数据集不同,当其元数据的主要资料来自一个网络爬虫或分类分析所有互联网文字和用户的搜索查询。与此相反,混合搜索引擎可能有一个或多个元数据集,例如,包括来自客户端的网络元数据,将所得的情境模型中的客户端上下文元数据俩认识这两个机构。

爬虫的工作原理网络爬虫是搜索引擎必不可少的组成部分:运行一个网络爬虫是一个极具挑战的任务。有技术和可靠性问题,更重要的是有社会问题。爬虫是最脆弱的应用程序,因为它涉及到交互的几百几千个Web服务器和各种域名服

务器,这些都超出了系统的控制。网页检索速度不仅由一个人的自己互联网连接速度有关,同时也受到了要抓取的网站的速度。特别是如果一个是从多个服务器抓取的网站,总爬行时间可以大大减少,如果许多下载是并行完成。虽然有众多的网络爬虫应用程序,他们在核心内容上基本上是相同的。以下是应用程序网络爬虫的工作过程:

下载网页

通过下载的页面解析和检索所有的联系

对于每一个环节检索,重复这个过程。网络爬虫可用于通过对完整的网站的局域网进行抓取。可以指定一个启动程序爬虫跟随在HTML页中找到所有链接。这通常导致更多的链接,这之后将再次跟随,等等。一个网站可以被视为一个树状结构看,根本是启动程序,在这根的HTML页的所有链接是根子链接。随后循环获得更多的链接。一个网页服务器提供若干网址清单给爬虫。网络爬虫开始通过解析一个指定的网页,标注该网页指向其他网站页面的超文本链接。然后他们分析这些网页之间新的联系,等等循环。网络爬虫软件不实际移动到各地不同的互联网上的电脑,而是像电脑病毒一样通过智能代理进行。每个爬虫每次大概打开大约300个链接。这是索引网页必须的足够快的速度。一个爬虫互留在一个机器。爬虫只是简单的将HTTP请求的文件发送到互联网的其他机器,就像一个网上浏览器的链接,当用户点击。所有的爬虫事实上是自动化追寻链接的过程。网页检索可视为一个队列处理的项目。当检索器访问一个网页,它提取到其他网页的链接。因此,爬虫置身于这些网址的一个队列的末尾,并继续爬行到下一个页面,然后它从队列前面删除。

资源约束爬行消耗资源:下载页面的带宽,支持私人数据结构存储的内存,来

评价和选择网址的CPU,以及存储文本和链接以及其他持久性数据的磁盘存储。

B.机器人协议机器人文件给出排除一部分的网站被抓取的指令。类似地,一个简单的文本文件可以提供有关的新鲜和出版对象的流行信息。对信息允许抓取工具优化其收集的数据刷新策略以及更换对象的政策。

C.元搜索引擎一个元搜索引擎是一种没有它自己的网页数据库的搜索引擎。它发出的搜索支持其他搜索引擎所有的数据库,从所有的搜索引擎查询并为用户提供的结果。较少的元搜索可以让您深入到最大,最有用的搜索引擎数据库。他们往往返回最小或免费的搜索引擎和其他免费目录并且通常是小和高度商业化的结果。

爬行技术

A:主题爬行一个通用的网络爬虫根据一个URL的特点设置来收集网页。凡为主题爬虫的设计有一个特定的主题的文件,从而减少了网络流量和下载量。主题爬虫的目标是有选择地寻找相关的网页的主题进行预先定义的设置。指定的主题不使用关键字,但使用示范文件。不是所有的收集和索引访问的Web文件能够回答所有可能的特殊查询,有一个主题爬虫爬行分析其抓起边界,找到链接,很可能是最适合抓取相关,并避免不相关的区域的Web。这导致在硬件和网络资源极大地节省,并有助于于保持在最新状态的数据。主题爬虫有三个主要组成部分一个分类器,这能够判断相关网页,决定抓取链接的拓展,过滤器决定过滤器抓取的网页,以确定优先访问中心次序的措施,以及均受量词和过滤器动态重新配置的优先的控制的爬虫。最关键的评价是衡量主题爬行收获的比例,这是在抓取过程中有多少比例相关网页被采用和不相干的网页是有效地过滤掉,这收获率最高,否则主题爬虫会花很多时间在消除不相关的网页,而且使用一个普通的爬虫可能会

更好。

B:分布式检索检索网络是一个挑战,因为它的成长性和动态性。随着网络规模越来越大,已经称为必须并行处理检索程序,以完成在合理的时间内下载网页。一个单一的检索程序,即使在是用多线程在大型引擎需要获取大量数据的快速上也存在不足。当一个爬虫通过一个单一的物理链接被所有被提取的数据所使用,通过分配多种抓取活动的进程可以帮助建立一个可扩展的易于配置的系统,它具有容错性的系统。拆分负载降低硬件要求,并在同一时间增加整体下载速度和可靠性。每个任务都是在一个完全分布式的方式,也就是说,没有中央协调器的存在。

挑战更多“有趣”对象的问题

搜索引擎被认为是一个热门话题,因为它收集用户查询记录。检索程序优先抓取网站根据一些重要的度量,例如相似性(对有引导的查询),返回链接数网页排名或者其他组合/变化最精Najork等。表明,首先考虑广泛优先搜索收集高品质页面,并提出一种网页排名。然而,目前,搜索策略是无法准确选择“最佳”路径,因为他们的认识仅仅是局部的。由于在互联网上可得到的信息数量非常庞大目前不可能实现全面的索引。因此,必须采用剪裁策略。主题爬行和智能检索,是发现相关的特定主题或主题集网页技术。

结论在本文中,我们得出这样的结论实现完整的网络爬行覆盖是不可能实现,因为受限于整个万维网的巨大规模和资源的可用性。通常是通过一种阈值的设置(网站访问人数,网站上树的水平,与主题等规定),以限制对选定的网站上进行抓取的过程。此信息是在搜索引擎可用于存储/刷新最相关和最新更新的网页,从而提高检索的内容质量,同时减少陈旧的内容和缺页。

外文译文原文:

Discussion on Web Crawlers of Search EngineAbstract-With the precipitous expansion of the Web,extracting knowledge from the Web is becoming gradually im portant and popular.This is due to the Web’s convenience and richness of information.To find Web pages, one typically uses search engines that are based on the Web crawling framework.This paper describes the basic task performed search engine.Overview of how the Web crawlers are related with search engine Keywords Distributed Crawling, Focused Crawling,Web Crawlers

Ⅰ.INTRODUCTION on the Web is a service that resides on computers that are connected to the Internet and allows end users to access data that is stored on the computers using standard interface software. The World Wide Web is the universe of network-accessible information,an embodiment of human knowledge Search engine is a computer program that searches for particular keywords and returns a list of documents in which they were found,especially a commercial service that scans documents on the Internet. A search engine finds information for its database by accepting listings sent it by authors who want exposure,or by getting the information from their “Web crawlers,””spiders,” or “robots,”programs that roam the Internet storing links to and information about each page they visit Web Crawler is a program, which fetches information from the World Wide Web in an automated manner.Web

crawling is an important research issue. Crawlers are software components, which visit portions of Web trees, according to certain strategies,and collect retrieved objects in local repositories The rest of the paper is organized as: in Section 2 we explain the background details of Web crawlers.In Section 3 we discuss on types of crawler, in Section 4 we will explain the working of Web crawler. In Section 5 we cover the two advanced techniques of Web crawlers. In the Section 6 we discuss the problem of selecting more interesting pages.

Ⅱ.SURVEY OF WEB CRAWLERSWeb crawlers are almost as old as the Web itself.The first crawler,Matthew Gray’s Wanderer, was written in the spring of 1993,roughly coinciding with the first release Mosaic.Several papers about Web crawling were presented at the first two World Wide Web conference.However,at the time, the Web was three to four orders of magnitude smaller than it is today,so those systems did not address the scaling problems inherent in a crawl of today’s Web Obviously, all of the popular search engines use crawlers that must scale up to substantial portions of the Web. However, due to the competitive nature of the search engine business, the designs of these crawlers have not been publicly described There are two notable exceptions:the Goole crawler and the Internet Archive crawler.Unfortunately,the descriptions of these crawlers in the literature are too terse to enable reproducibility The original Google crawler developed at Stanford consisted of five

functional components running in different processes. A URL server process read URLs out of a file and forwarded them to multiple crawler processes.Each crawler process ran on a different machine,was single-threaded,and used asynchronous I/O to fetch data from up to 300 Web servers in parallel. The crawlers transmitted downloaded pages to a single Store Server process, which compressed the pages and stored them to disk.The page were then read back from disk by an indexer process, which extracted links from HTML pages and saved them to a different disk file.

A URLs resolver process read the link file, relative the URLs contained there in, and saved the absolute URLs to the disk file that was read by the URL server. Typically,three to four crawler machines were used, so the entire system required between four and eight machines Research on Web crawling continues at Stanford even after Google has been transformed into a commercial effort.The Stanford Web Base project has implemented a high performance distributed crawler,capable of downloading 50 to 100 documents per second.Cho and others have also developed models of documents update frequencies to inform the download schedule of incremental crawlers The Internet Archive also used multiple machines to crawl the Web.Each crawler process was assigned up to 64 sites to crawl, and no site was assigned to more than one crawler.Each single-threaded crawler process read a list of seed URLs for its assigned sited from disk int per-site queues,and then used asynchronous I/O to fetch pages from

these queues in parallel. Once a page was downloaded, the crawler extracted the links contained in it.If a link referred to the site of the page it was contained in, it was added to the appropriate site queue;otherwise it was logged to disk .Periodically, a batch process merged these logged “cross-sit” URLs into the site--specific seed sets, filtering out duplicates in the process.

The Web Fountain crawler shares several of Mercator’s characteristics:it is distributed,continuousthe authors use the term”incremental”,polite, and configurable.Unfortunately,as of this writing,Web Fountain is in the early stages of its development, and data about its performance is not yet available.

Ⅲ.BASIC TYPESS OF SEARCH ENGINE

Crawler Based Search Engines

Crawler based search engines create their listings automaticallyputer programs ‘spider’ build them not by human selection. They are not organized by subject categories; a computer algorithm ranks all pages. Such kinds of search engines are huge and often retrieve a lot of information -- for complex searches it allows to search within the results of a previous search and enables you to refine search results. These types of search engines contain full text of the Web pages they link to .So one cann find pages by matching words in the pages one wants;

B. Human Powered Directories

These are built by human selection i.e. They depend on humans to create listings. They are organized into subject categories and subjects do classification of pages.Human powered directories never contain full text of the Web page they link to .They are smaller than most search engines.

C.Hybrid Search Engine

A hybrid search engine differs from traditional text oriented search engine such as Google or a directory-based search engine such as Yahoo in which each program operates by comparing a set of meta data, the primary corpus being the meta data derived from a Web crawler or taxonomic analysis of all internet text,and a user search query.In contrast, hybrid search engine may use these two bodies of meta data in addition to one or more sets of meta data that can, for example, include situational meta data derived from the client’s network that would model the context awareness of the client.

Ⅳ.WORKING OF A WEB CRAWLER

Web crawlers are an essential component to search engines;running a Web crawler is a challenging task.There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of Web servers and various name servers, which are all beyond the control of the system. Web crawling speed is governed not only by the speed of one’s own Internet connection ,but also by the

speed of the sites that are to be crawled.Especially if one is a crawling site from multiple servers, the total crawling time can be significantly reduced,if many downloads are done in parallel.

Despite the numerous applications for Web crawlers,at the core they are all fundamentally the same. Following is the process by which Web crawlers work:

Download the Web page.

Parse through the downloaded page and retrieve all the links.

For each link retrieved,repeat the process The Web crawler can be used for crawling through a whole site on the Inter-/Intranet You specify a start-URL and the Crawler follows all links found in that HTML page.This usually leads to more links,which will be followed again, and so on.A site can be seen as a tree-structure,the root is the start-URL;all links in that root-HTML-page are direct sons of the root. Subsequent links are then sons of the previous sons A single URL Server serves lists of URLs to a number of crawlers.Web crawler starts by parsing a specified Web page,noting any hypertext links on that page that point to other Web pages.They then parse those pages for new links,and so on,recursively.Web Crawler software doesn’t actually move around to different computers on the Internet, as viruses or intelligent agents do. Each crawler keeps roughly 300 connections open at once.This is necessary to retrieve Web page at a fast enough pace. A crawler resides on a single machine. The

crawler simply sends HTTP requests for documents to other machines on the Internet, just as a Web browser does when the user clicks on links. All the crawler really does is to automate the process of following links.

Web crawling can be regarded as processing items in a queue. When the crawler visits a Web page,it extracts links to other Web pages.So the crawler puts these URLs at the end of a queue, and continues crawling to a URL that it removes from the front of the queue.

Resource ConstraintsCrawlers consume resources: network bandwidth to download pages,memory to maintain private data structures in support of their algorithms,CUP to evaluate and select URLs,and disk storage to store the text and links of fetched pages as well as other persistent data.

B.Robot ProtocolThe robot.txt file gives directives for excluding a portion of a Web site to be crawled. Analogously,a simple text file can furnish information about the freshness and popularity fo published objects.This information permits a crawler to optimize its strategy for refreshing collected data as well as replacing object policy.

C.Meta Search EngineA meta-search engine is the kind of search engine that does not have its own database of Web pages.It sends search terms to the databases maintained by other search engines and gives users the result that come from all the search engines queried. Fewer meta searchers allow you to delve into the largest, most useful search engine databases. They tend to return results from smaller add/or search engines and

miscellaneous free directories, often small and highly commercial Ⅴ.CRAWLING TECHNIQUES

Focused CrawlingA general purpose Web crawler gathers as many pages as it can from a particular set of URL’s.Where as a focused crawler is designed to only gather documents on a specific topic,thus reducing the amount of network traffic and downloads.The goal of the focused crawler is to selectively seek out pages that are relevant to a predefined set of topics.The topics re specified not using keywords,but using exemplary documents Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl,and avoids irrelevant regions of the Web This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. The focused crawler has three main components;: a classifier which makes relevance judgments on pages,crawled to decide on link expansion,a distiller which determines a measure of centrality of crawled pages to determine visit priorities, and a crawler with dynamically reconfigurable priority controls which is governed by the classifier and distiller The most crucial evaluation of focused crawling is to measure the harvest ratio, which is rate at which relevant pages are acquired and irrelevant pages are effectively filtered off from the crawl.This harvest ratio must be high ,otherwise the focused

crawler would spend a lot of time merely eliminating irrelevant pages, and it may be better to use an ordinary crawler instead.

B.Distributed CrawlingIndexing the Web is a challenge due to its growing and dynamic nature.As the size of the Web sis growing it has become imperative to parallelize the crawling process in order to finish downloading the pages in a reasonable amount of time.A single crawling process even if multithreading is used will be insufficient for large - scale engines that need to fetch large amounts of data rapidly.When a single centralized crawler is used all the fetched data passes through a single physical link.Distributing the crawling activity via multiple processes can help build a scalable, easily configurable system,which is fault tolerant system.Splitting the load decreases hardware requirements and at the same time increases the overall download speed and reliability. Each task is performed in a fully distributed fashion,that is ,no central coordinator exits.

Ⅵ.PROBLEM OF SELECTING MORE “INTERESTING”A search engine is aware of hot topics because it collects user queries.The crawling process prioritizes URLs according to an importance metric such as similarityto a driving query,back-link count,Page Rank or their combinations/variations.Recently Najork et al. Showed that breadth-first search collects high-quality pages first and suggested a variant of Page Rank.However,at the moment,search strategies are unable to exactly select

the “best” paths because their knowledge is only partial.Due to the

enormous amount of information available on the Internet a total-crawling

is at the moment impossible,thus,prune strategies must be applied.Focused

crawling and intelligent crawling,are techniques for discovering Web

pages relevant to a specific topic or set of topics.

CONCLUSIONIn this paper we conclude that complete web crawling

coverage cannot be achieved, due to the vast size of the whole and to

resource https://www.360docs.net/doc/3a11421492.html,ually a kind of threshold is set upnumber of visited URLs, level in the website tree,compliance with a t

毕业论文外文翻译模版

吉林化工学院理学院 毕业论文外文翻译English Title(Times New Roman ,三号) 学生学号:08810219 学生姓名:袁庚文 专业班级:信息与计算科学0802 指导教师:赵瑛 职称副教授 起止日期:2012.2.27~2012.3.14 吉林化工学院 Jilin Institute of Chemical Technology

1 外文翻译的基本内容 应选择与本课题密切相关的外文文献(学术期刊网上的),译成中文,与原文装订在一起并独立成册。在毕业答辩前,同论文一起上交。译文字数不应少于3000个汉字。 2 书写规范 2.1 外文翻译的正文格式 正文版心设置为:上边距:3.5厘米,下边距:2.5厘米,左边距:3.5厘米,右边距:2厘米,页眉:2.5厘米,页脚:2厘米。 中文部分正文选用模板中的样式所定义的“正文”,每段落首行缩进2字;或者手动设置成每段落首行缩进2字,字体:宋体,字号:小四,行距:多倍行距1.3,间距:前段、后段均为0行。 这部分工作模板中已经自动设置为缺省值。 2.2标题格式 特别注意:各级标题的具体形式可参照外文原文确定。 1.第一级标题(如:第1章绪论)选用模板中的样式所定义的“标题1”,居左;或者手动设置成字体:黑体,居左,字号:三号,1.5倍行距,段后11磅,段前为11磅。 2.第二级标题(如:1.2 摘要与关键词)选用模板中的样式所定义的“标题2”,居左;或者手动设置成字体:黑体,居左,字号:四号,1.5倍行距,段后为0,段前0.5行。 3.第三级标题(如:1.2.1 摘要)选用模板中的样式所定义的“标题3”,居左;或者手动设置成字体:黑体,居左,字号:小四,1.5倍行距,段后为0,段前0.5行。 标题和后面文字之间空一格(半角)。 3 图表及公式等的格式说明 图表、公式、参考文献等的格式详见《吉林化工学院本科学生毕业设计说明书(论文)撰写规范及标准模版》中相关的说明。

外文翻译--农村金融主流的非正规金融机构

附录1 RURAL FINANCE: MAINSTREAMING INFORMAL FINANCIAL INSTITUTIONS By Hans Dieter Seibel Abstract Informal financial institutions (IFIs), among them the ubiquitous rotating savings and credit associations, are of ancient origin. Owned and self-managed by local people, poor and non-poor, they are self-help organizations which mobilize their own resources, cover their costs and finance their growth from their profits. With the expansion of the money economy, they have spread into new areas and grown in numbers, size and diversity; but ultimately, most have remained restricted in size, outreach and duration. Are they best left alone, or should they be helped to upgrade their operations and be integrated into the wider financial market? Under conducive policy conditions, some have spontaneously taken the opportunity of evolving into semiformal or formal microfinance institutions (MFIs). This has usually yielded great benefits in terms of financial deepening, sustainability and outreach. Donors may build on these indigenous foundations and provide support for various options of institutional development, among them: incentives-driven mainstreaming through networking; encouraging the establishment of new IFIs in areas devoid of financial services; linking IFIs/MFIs to banks; strengthening Non-Governmental Organizations (NGOs) as promoters of good practices; and, in a nonrepressive policy environment, promoting appropriate legal forms, prudential regulation and delegated supervision. Key words: Microfinance, microcredit, microsavings。 1. informal finance, self-help groups In March 1967, on one of my first field trips in Liberia, I had the opportunity to observe a group of a dozen Mano peasants cutting trees in a field belonging to one of them. Before they started their work, they placed hoe-shaped masks in a small circle, chanted words and turned into animals. One turned into a lion, another one into a bush hog, and so on, and they continued to imitate those animals throughout the whole day, as they worked hard on their land. I realized I was onto something serious, and at the end of the day, when they had put the masks into a bag and changed back into humans,

毕设开题报告-及开题报告分析

开题报告如何写 注意点 1.一、对指导教师下达的课题任务的学习与理解 这部分主要是阐述做本课题的重要意义 2.二、阅读文献资料进行调研的综述 这部分就是对课题相关的研究的综述落脚于本课题解决了那些关键问题 3.三、根据任务书的任务及文件调研结果,初步拟定执行实施的方案(含具体进度计划) 这部分重点写具体实现的技术路线方案的具体实施方法和步骤了,具体进度计划只是附在后面的东西不是重点

南京邮电大学通达学院毕业设计(论文)开题报告

文献[5] 基于信息数据分析的微博研究综述[J];研究微博信息数据的分析,在这类研究中,大多数以微博消息传播的三大构件---微博消息、用户、用户关系为研究对象。以微博消息传播和微博成员组织为主要研究内容,目的在于发祥微博中用户、消息传博、热点话题、用户关系网络等的规律。基于微博信息数据分析的研究近年来在国内外都取得了很多成果,掌握了微博中的大量特征。该文献从微博消息传播三大构件的角度,对当前基于信息数据分析的微博研究进行系统梳理,提出微博信息传播三大构件的概念,归纳了此类研究的主要研究内容及方法。 对于大多用户提出的与主题或领域相关的查询需求,传统的通用搜索引擎往往不能提供令人满意的结果网页。为了克服通用搜索引擎的以上不足,提出了面向主题的聚焦爬虫的研究。文献[6]综述了聚焦爬虫技术的研究。其中介绍并分析了聚焦爬虫中的关键技术:抓取目标定义与描述,网页分析算法和网页分析策略,并根据网络拓扑、网页数据内容、用户行为等方面将各种网页分析算法做了分类和比较。聚焦爬虫能够克服通用爬虫的不足之处。 文献[7]首先介绍了网络爬虫工作原理,传统网络爬虫的实现过程,并对网络爬虫中使用的关键技术进行了研究,包括网页搜索策略、URL去重算法、网页分析技术、更新策略等。然后针对微博的特点和Ajax技术的实现方法,指出传统网络爬虫的不足,以及信息抓取的技术难点,深入分析了现有的基于Ajax的网络爬虫的最新技术——通过模拟浏览器行为,触发JavaScript事件(如click, onmouseo ver等),解析JavaScript脚本,动态更新网页DOM树,抽取网页中的有效信息。最后,详细论述了面向SNS网络爬虫系统的设计方案,整体构架,以及各功能模块的具体实现。面向微博的网络爬虫系统的实现是以新浪微博作为抓取的目标网站。结合新浪微博网页的特点,通过模拟用户行为,解析JavaSc ript,建立DOM树来获取网页动态信息,并按照一定的规则提取出网页中的URL和有效信息,并将有效信息存入数据库。本系统成功的实现了基于Ajax技术的网页信息的提取。 文献[8]引入网页页面分析技术和主题相关性分析技术,解决各大网站微博相继提供了抓取微博的API,这些API都有访问次数的限制,无法满足获取大量微博数据的要求,同时抓取的数据往往很杂乱的问题。展开基于主题的微博网页爬虫的研究与设计。本文的主要工作有研究分析网页页面分析技术,根据微博页面特点选择微博页面信息获取方法;重点描述基于“剪枝”的广度优先搜索策略的思考以及设计的详细过程,着重解决URL的去重、URL地址集合动态变化等问题;研究分析短文本主题抽取技术以及多关键匹配技术,确定微博主题相关性分析的设计方案;最后设计实现基于主题的微博网页爬虫的原型系统,实时抓取和存储微博数据。本文研究的核心问题是,根据微博数据的特点设计一种基于“剪枝”的广度优先搜索策略,并将其应用到微博爬虫中;同时使用微博页面分析技术使得爬虫不受微博平台API限制,从而让用户尽可能准确地抓取主题相关的微博数据。通过多次反复实验获取原型系统实验结果,将实验结果同基于API微博爬虫和基于网页微博爬虫的抓取效果进行对比分析得出结论:本文提出的爬行策略能够抓取主题相关的微博数据,虽然在效率上有所降低,但在抓取的微博数据具有较好的主题相关性。这实验结果证明本论文研究的实现方案是可行的。 文献[9]阐述了基于ajax的web应用程序的爬虫和用户界面状态改变的动态分析的过程和思路。文献[10]对于全球社交网络Twitter,设计并实现了,一个爬虫系统,从另一个角度阐明了Python在编写爬虫这个方面的强大和快速。仅仅用少量的代码就能实现爬虫系统,并且再强大的社交网站也可

计算机网络安全文献综述

计算机网络安全综述学生姓名:李嘉伟 学号:11209080279 院系:信息工程学院指导教师姓名:夏峰二零一三年十月

[摘要] 随着计算机网络技术的快速发展,网络安全日益成为人们关注的焦点。本文分析了影响网络安全的主要因素及攻击的主要方式,从管理和技术两方面就加强计算机网络安全提出了针对性的建议。 [关键词] 计算机网络;安全;管理;技术;加密;防火墙 一.引言 计算机网络是一个开放和自由的空间,但公开化的网络平台为非法入侵者提供了可乘之机,黑客和反黑客、破坏和反破坏的斗争愈演愈烈,不仅影响了网络稳定运行和用户的正常使用,造成重大经济损失,而且还可能威胁到国家安全。如何更有效地保护重要的信息数据、提高计算机网络的安全性已经成为影响一个国家的政治、经济、军事和人民生活的重大关键问题。本文通过深入分析网络安全面临的挑战及攻击的主要方式,从管理和技术两方面就加强计算机网络安全提出针对性建议。

二.正文 1.影响网络安全的主要因素[1] 计算机网络安全是指“为数据处理系统建立和采取的技术和管理的安全保护,保护计算机硬件、软件数据不因偶然和恶意的原因而遭到破坏、更改和泄漏”。计算机网络所面临的威胁是多方面的,既包括对网络中信息的威胁,也包括对网络中设备的威胁,但归结起来,主要有三点:一是人为的无意失误。如操作员安全配置不当造成系统存在安全漏洞,用户安全意识不强,口令选择不慎,将自己的帐号随意转借他人或与别人共享等都会给网络安全带来威胁。二是人为的恶意攻击。这也是目前计算机网络所面临的最大威胁,比如敌手的攻击和计算机犯罪都属于这种情况,此类攻击又可以分为两种:一种是主动攻击,它以各种方式有选择地破坏信息的有效性和完整性;另一类是被动攻击,它是在不影响网络正常工作的情况下,进行截获、窃取、破译以获得重要机密信息。这两种攻击均可对计算机网络造成极大的危害,并导致机密数据的泄漏。三是网络软件的漏洞和“后门”。任何一款软件都或多或少存在漏洞,这些缺陷和漏洞恰恰就是黑客进行攻击的首选目标。绝大部分网络入侵事件都是因为安全措施不完善,没有及时补上系统漏洞造成的。此外,软件公司的编程人员为便于维护而设置的软件“后门”也是不容忽视的巨大威胁,一旦“后门”洞开,别人就能随意进入系统,后果不堪设想。

网络爬虫外文翻译

外文资料 ABSTRACT Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon. 1. INTRODUCTION A recent Pew Foundation study [31] states that “Search eng ines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightly

毕业论文 外文翻译#(精选.)

毕业论文(设计)外文翻译 题目:中国上市公司偏好股权融资:非制度性因素 系部名称:经济管理系专业班级:会计082班 学生姓名:任民学号: 200880444228 指导教师:冯银波教师职称:讲师 年月日

译文: 中国上市公司偏好股权融资:非制度性因素 国际商业管理杂志 2009.10 摘要:本文把重点集中于中国上市公司的融资活动,运用西方融资理论,从非制度性因素方面,如融资成本、企业资产类型和质量、盈利能力、行业因素、股权结构因素、财务管理水平和社会文化,分析了中国上市公司倾向于股权融资的原因,并得出结论,股权融资偏好是上市公司根据中国融资环境的一种合理的选择。最后,针对公司的股权融资偏好提出了一些简明的建议。 关键词:股权融资,非制度性因素,融资成本 一、前言 中国上市公司偏好于股权融资,根据中国证券报的数据显示,1997年上市公司在资本市场的融资金额为95.87亿美元,其中股票融资的比例是72.5%,,在1998年和1999年比例分别为72.6%和72.3%,另一方面,债券融资的比例分别是17.8%,24.9%和25.1%。在这三年,股票融资的比例,在比中国发达的资本市场中却在下跌。以美国为例,当美国企业需要的资金在资本市场上,于股权融资相比他们宁愿选择债券融资。统计数据显示,从1970年到1985年,美日企业债券融资占了境外融资的91.7%,比股权融资高很多。阎达五等发现,大约中国3/4的上市公司偏好于股权融资。许多研究的学者认为,上市公司按以下顺序进行外部融资:第一个是股票基金,第二个是可转换债券,三是短期债务,最后一个是长期负债。许多研究人员通常分析我国上市公司偏好股权是由于我们国家的经济改革所带来的制度性因素。他们认为,上市公司的融资活动违背了西方古典融资理论只是因为那些制度性原因。例如,优序融资理论认为,当企业需要资金时,他们首先应该转向内部资金(折旧和留存收益),然后再进行债权融资,最后的选择是股票融资。在这篇文章中,笔者认为,这是因为具体的金融环境激活了企业的这种偏好,并结合了非制度性因素和西方金融理论,尝试解释股权融资偏好的原因。

大学毕业论文---软件专业外文文献中英文翻译

软件专业毕业论文外文文献中英文翻译 Object landscapes and lifetimes Tech nically, OOP is just about abstract data typing, in herita nee, and polymorphism, but other issues can be at least as importa nt. The rema in der of this sect ion will cover these issues. One of the most importa nt factors is the way objects are created and destroyed. Where is the data for an object and how is the lifetime of the object con trolled? There are differe nt philosophies at work here. C++ takes the approach that con trol of efficie ncy is the most importa nt issue, so it gives the programmer a choice. For maximum run-time speed, the storage and lifetime can be determined while the program is being written, by placing the objects on the stack (these are sometimes called automatic or scoped variables) or in the static storage area. This places a priority on the speed of storage allocatio n and release, and con trol of these can be very valuable in some situati ons. However, you sacrifice flexibility because you must know the exact qua ntity, lifetime, and type of objects while you're writing the program. If you are trying to solve a more general problem such as computer-aided desig n, warehouse man ageme nt, or air-traffic con trol, this is too restrictive. The sec ond approach is to create objects dyn amically in a pool of memory called the heap. In this approach, you don't know un til run-time how many objects you n eed, what their lifetime is, or what their exact type is. Those are determined at the spur of the moment while the program is runnin g. If you n eed a new object, you simply make it on the heap at the point that you n eed it. Because the storage is man aged dyn amically, at run-time, the amount of time required to allocate storage on the heap is sig ni fica ntly Ion ger tha n the time to create storage on the stack. (Creat ing storage on the stack is ofte n a si ngle assembly in structio n to move the stack poin ter dow n, and ano ther to move it back up.) The dyn amic approach makes the gen erally logical assumpti on that objects tend to be complicated, so the extra overhead of finding storage and releas ing that storage will not have an importa nt impact on the creati on of an object .In additi on, the greater flexibility is esse ntial to solve the gen eral program ming problem. Java uses the sec ond approach, exclusive". Every time you want to create an object, you use the new keyword to build a dyn amic in sta nee of that object. There's ano ther issue, however, and that's the lifetime of an object. With Ian guages that allow objects to be created on the stack, the compiler determines how long the object lasts and can automatically destroy it. However, if you create it on the heap the compiler has no kno wledge of its lifetime. In a Ianguage like C++, you must determine programmatically when to destroy the

网络安全外文翻译文献

网络安全外文翻译文献 (文档含英文原文和中文翻译) 翻译: 计算机网络安全与防范 1.1引言 计算机技术的飞速发展提供了一定的技术保障,这意味着计算机应用已经渗透到社会的各个领域。在同一时间,巨大的进步和网络技术的普及,社会带来了巨大的经济利润。然而,在破坏和攻击计算机信息系统的方法已经改变了很多的网络环境下,网络安全问题逐渐成为计算机安全的主流。

1.2网络安全 1.2.1计算机网络安全的概念和特点 计算机网络的安全性被认为是一个综合性的课题,由不同的人,包括计算机科学、网络技术、通讯技术、信息安全技术、应用数学、信息理论组成。作为一个系统性的概念,网络的安全性由物理安全、软件安全、信息安全和流通安全组成。从本质上讲,网络安全是指互联网信息安全。一般来说,安全性、集成性、可用性、可控性是关系到网络信息的相关理论和技术,属于计算机网络安全的研究领域。相反,狭隘“网络信息安全”是指网络安全,这是指保护信息秘密和集成,使用窃听、伪装、欺骗和篡夺系统的安全性漏洞等手段,避免非法活动的相关信息的安全性。总之,我们可以保护用户利益和验证用户的隐私。 计算机网络安全有保密性、完整性、真实性、可靠性、可用性、非抵赖性和可控性的特点。 隐私是指网络信息不会被泄露给非授权用户、实体或程序,但是授权的用户除外,例如,电子邮件仅仅是由收件人打开,其他任何人都不允许私自这样做。隐私通过网络信息传输时,需要得到安全保证。积极的解决方案可能会加密管理信息。虽然可以拦截,但它只是没有任何重要意义的乱码。 完整性是指网络信息可以保持不被修改、破坏,并在存储和传输过程中丢失。诚信保证网络的真实性,这意味着如果信息是由第三方或未经授权的人检查,内容仍然是真实的和没有被改变的。因此保持完整性是信息安全的基本要求。 可靠性信息的真实性主要是确认信息所有者和发件人的身份。 可靠性表明该系统能够在规定的时间和条件下完成相关的功能。这是所有的网络信息系统的建立和运作的基本目标。 可用性表明网络信息可被授权实体访问,并根据自己的需求使用。 不可抵赖性要求所有参加者不能否认或推翻成品的操作和在信息传输过程中的承诺。

毕业论文外文翻译模板

农村社会养老保险的现状、问题与对策研究社会保障对国家安定和经济发展具有重要作用,“城乡二元经济”现象日益凸现,农村社会保障问题客观上成为社会保障体系中极为重要的部分。建立和完善农村社会保障制度关系到农村乃至整个社会的经济发展,并且对我国和谐社会的构建至关重要。我国农村社会保障制度尚不完善,因此有必要加强对农村独立社会保障制度的构建,尤其对农村养老制度的改革,建立健全我国社会保障体系。从户籍制度上看,我国居民养老问题可分为城市居民养老和农村居民养老两部分。对于城市居民我国政府已有比较充足的政策与资金投人,使他们在物质和精神方面都能得到较好地照顾,基本实现了社会化养老。而农村居民的养老问题却日益突出,成为摆在我国政府面前的一个紧迫而又棘手的问题。 一、我国农村社会养老保险的现状 关于农村养老,许多地区还没有建立农村社会养老体系,已建立的地区也存在很多缺陷,运行中出现了很多问题,所以完善农村社会养老保险体系的必要性与紧迫性日益体现出来。 (一)人口老龄化加快 随着城市化步伐的加快和农村劳动力的输出,越来越多的农村青壮年人口进入城市,年龄结构出现“两头大,中间小”的局面。中国农村进入老龄社会的步伐日渐加快。第五次人口普查显示:中国65岁以上的人中农村为5938万,占老龄总人口的67.4%.在这种严峻的现实面前,农村社会养老保险的徘徊显得极其不协调。 (二)农村社会养老保险覆盖面太小 中国拥有世界上数量最多的老年人口,且大多在农村。据统计,未纳入社会保障的农村人口还很多,截止2000年底,全国7400多万农村居民参加了保险,占全部农村居民的11.18%,占成年农村居民的11.59%.另外,据国家统计局统计,我国进城务工者已从改革开放之初的不到200万人增加到2003年的1.14亿人。而基本方案中没有体现出对留在农村的农民和进城务工的农民给予区别对待。进城务工的农民既没被纳入到农村养老保险体系中,也没被纳入到城市养老保险体系中,处于法律保护的空白地带。所以很有必要考虑这个特殊群体的养老保险问题。

农村金融小额信贷中英文对照外文翻译文献

农村金融小额信贷中英文对照外文翻译文献(文档含英文原文和中文翻译) RURAL FINANCE: MAINSTREAMING INFORMAL FINANCIAL INSTITUTIONS By Hans Dieter Seibel Abstract Informal financial institutions (IFIs), among them the ubiquitous rotating savings and credit associations, are of ancient origin. Owned and self-managed by local people, poor and non-poor, they are self-help organizations which mobilize their own resources, cover their costs and finance their growth from their profits. With the expansion of the money economy, they have spread into new areas and grown in numbers, size and diversity; but ultimately, most have remained restricted in size, outreach and duration. Are they best left alone, or should they be helped to upgrade

简析网络语言的文献综述

浅析网络语言的文献综述 摘要 语言是一种文化,一个民族要有文化前途,靠的是创新。从这个意义上说,新词语用过了些并不可怕,如果语言僵化,词汇贫乏,那才是真正的可悲。语汇系统如果只有基本词,永远稳稳当当,语言就没有生命力可言,因此,在规定一定的规范的同时,要允许歧疑的存在,但更要积极吸收那些脱离当时的规范而能促进语言的丰富和发展的成分。正确看待网络语言。 关键字 网络语言;因素;发展趋势; 一、关于“网络语言”涵义及现状的研究 1.网络语言的涵义研究 网络语言是一个有着多种理解的概念,既可以指称网络特有的言语表达方式,也可以指网络中使用的自然语言,还可以把网络中使用的所有符号全部包括在内。网络语言起初多指网络语言的研究现状(网络的计算机语言,又指网络上使用的有自己特点的自然语言。于根元,2001)。 较早开展网络语言研究的劲松、麒可(2000)认为,广义的网络语言是与网络时代、e时代出现的与网络和电子技术有关的“另类语言”;狭义的网络语言指自称网民、特称网虫的语言。 周洪波(2001)则认为,网络语言是指人们在网络交流中所使用的语言形式,大体上可分为三类:一是与网络有关的专业术语;二是与网络有关的特别用语;三是网民在聊天室和BBS上的常用词语。 于根元(2003)指出,“网络语言”本身也是一个网络用语。起初多指网络的计算机语言,也指网络上使用的有自己特点的自然语言。现在一般指后者。狭义的网络语言指论坛和聊天室的具有特点的用语。 何洪峰(2003)进一步指出,网络语言是指媒体所使用的语言,其基本词汇及语法结构形式还是全民使用的现代汉语,这是它的主体形式;二是指IT领域的专业用语,或是指与电子计算机联网或网络活动相关的名词术语;其三,狭义上是指网民所创造的一些特殊的信息符号。总的看来,研究者基本认为网络语言有广义、狭义两种含义,广义的网络语言主要指与网络有关的专业术语,狭义的网络语言主要指在聊天室和BBS上常用的词语和符号。 2. 网络语言的研究现状 如:国人大常委会委员原国家教委副主任柳斌表示,网络语言的混乱,是对汉语纯洁性的破坏,语言文字工作者应对此类现象加以引导和批评。国家网络工程委会副秘书史自文表示,老师要引导学生使用网络语言。比如说在写出作文的时候,可以针对彩简单的网络语言还是用含义更有韵味的唐诗更好做一个主题研讨会,和学生一起探讨。这样就可以在理解、尊重学生的基础上进行引导。经过这样的过程,学生对于用何种语言形式多了一个选择,又加深了对传统文化的理解。 如:北京教科院基教所研究员王晓春表示,在网络世界里用网络语言无可厚非。但在正式场合要引导学生不使用网络语言。在教学中老师要引导学生如何正

电子信息工程专业毕业论文外文翻译中英文对照翻译

本科毕业设计(论文)中英文对照翻译 院(系部)电气工程与自动化 专业名称电子信息工程 年级班级 04级7班 学生姓名 指导老师

Infrared Remote Control System Abstract Red outside data correspondence the technique be currently within the scope of world drive extensive usage of a kind of wireless conjunction technique,drive numerous hardware and software platform support. Red outside the transceiver product have cost low, small scaled turn, the baud rate be quick, point to point SSL, be free from electromagnetism thousand Raos etc.characteristics, can realization information at dissimilarity of the product fast, convenience, safely exchange and transmission, at short distance wireless deliver aspect to own very obvious of advantage.Along with red outside the data deliver a technique more and more mature, the cost descend, red outside the transceiver necessarily will get at the short distance communication realm more extensive of application. The purpose that design this system is transmit cu stomer’s operation information with infrared rays for transmit media, then demodulate original signal with receive circuit. It use coding chip to modulate signal and use decoding chip to demodulate signal. The coding chip is PT2262 and decoding chip is PT2272. Both chips are made in Taiwan. Main work principle is that we provide to input the information for the PT2262 with coding keyboard. The input information was coded by PT2262 and loading to high frequent load wave whose frequent is 38 kHz, then modulate infrared transmit dioxide and radiate space outside when it attian enough power. The receive circuit receive the signal and demodulate original information. The original signal was decoded by PT2272, so as to drive some circuit to accomplish

网络安全中的中英对照

网络安全中的中英对照 Access Control List(ACL)访问控制列表 access token 访问令牌 account lockout 帐号封锁 account policies 记帐策略 accounts 帐号 adapter 适配器 adaptive speed leveling 自适应速率等级调整 Address Resolution Protocol(ARP) 地址解析协议Administrator account 管理员帐号 ARPANET 阿帕网(internet的前身) algorithm 算法 alias 别名 allocation 分配、定位 alias 小应用程序 allocation layer 应用层 API 应用程序编程接口 anlpasswd 一种与Passwd+相似的代理密码检查器 applications 应用程序 ATM 异步传递模式

audio policy 审记策略 auditing 审记、监察 back-end 后端 borde 边界 borde gateway 边界网关 breakabie 可破密的 breach 攻破、违反 cipher 密码 ciphertext 密文 CAlass A domain A类域 CAlass B domain B类域 CAlass C domain C类域 classless addressing 无类地址分配 cleartext 明文 CSNW Netware客户服务 client 客户,客户机 client/server 客户机/服务器 code 代码 COM port COM口(通信端口) CIX 服务提供者 computer name 计算机名

搜索引擎爬虫外文翻译文献

搜索引擎爬虫外文翻译文献 (文档含中英文对照即英文原文和中文翻译) 译文: 探索搜索引擎爬虫 随着网络难以想象的急剧扩张,从Web中提取知识逐渐成为一种受欢迎的途径。这是由于网络的便利和丰富的信息。通常需要使用基于网络爬行的搜索引擎来找到我们需要的网页。本文描述了搜索引擎的基本工作任务。概述了搜索引擎与网络爬虫之间的联系。 关键词:爬行,集中爬行,网络爬虫 1.导言 在网络上WWW是一种服务,驻留在链接到互联网的电脑上,并允许最终用户访问是用标准的接口软件的计算机中的存储数据。万维网是获取访问网络信息的宇

宙,是人类知识的体现。 搜索引擎是一个计算机程序,它能够从网上搜索并扫描特定的关键字,尤其是商业服务,返回的它们发现的资料清单,抓取搜索引擎数据库的信息主要通过接收想要发表自己作品的作家的清单或者通过“网络爬虫”、“蜘蛛”或“机器人”漫游互联网捕捉他们访问过的页面的相关链接和信息。 网络爬虫是一个能够自动获取万维网的信息程序。网页检索是一个重要的研究课题。爬虫是软件组件,它访问网络中的树结构,按照一定的策略,搜索并收集当地库中检索对象。 本文的其余部分组织如下:第二节中,我们解释了Web爬虫背景细节。在第3节中,我们讨论爬虫的类型,在第4节中我们将介绍网络爬虫的工作原理。在第5节,我们搭建两个网络爬虫的先进技术。在第6节我们讨论如何挑选更有趣的问题。 2.调查网络爬虫 网络爬虫几乎同网络本身一样古老。第一个网络爬虫,马修格雷浏览者,写于1993年春天,大约正好与首次发布的OCSA Mosaic网络同时发布。在最初的两次万维网会议上发表了许多关于网络爬虫的文章。然而,在当时,网络i现在要小到三到四个数量级,所以这些系统没有处理好当今网络中一次爬网固有的缩放问题。 显然,所有常用的搜索引擎使用的爬网程序必须扩展到网络的实质性部分。但是,由于搜索引擎是一项竞争性质的业务,这些抓取的设计并没有公开描述。有两个明显的例外:股沟履带式和网络档案履带式。不幸的是,说明这些文献中的爬虫程序是太简洁以至于能够进行重复。 原谷歌爬虫(在斯坦福大学开发的)组件包括五个功能不同的运行流程。服务器进程读取一个URL出来然后通过履带式转发到多个进程。每个履带进程运行在不同的机器,是单线程的,使用异步I/O采用并行的模式从最多300个网站来抓取数据。爬虫传输下载的页面到一个能进行网页压缩和存储的存储服务器进程。然后这些页面由一个索引进程进行解读,从HTML页面中提取链接并将他们保存到不同的磁盘文件中。一个URL解析器进程读取链接文件,并将相对的网址进行存储,并保存了完整的URL到磁盘文件然后就可以进行读取了。通常情况下,因

相关文档
最新文档