计算机外文翻译---基于网络爬虫的有效URL缓存

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

外文资料原文

Efficient URL Caching for World Wide Web Crawling

Andrei Z. Broder

IBM TJ WatsonResearchCenter

19 Skyline Dr

Hawthorne, NY10532

abroder@

Marc Najork

Microsoft Research

1065 La Avenida

Mountain View, CA94043

najork@

Janet L. Wiener

Hewlett Packard Labs

1501 Page Mill Road

Palo Alto, CA94304

janet.wiener@

ABSTRACT

Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all theURLs not seen before, repeat (a)–(c). However, the size of the web(estimated at over 4 billion pages) and its rate of change (estimatedat 7% per week) move this plan from a trivial programming exerciseto a serious algorithmic and system design challenge. Indeed, thesetwo factors alone imply that for a reasonably fresh and completecrawl of the web, step (a) must be executed about a thousand timesper second, and thus the membership test (c) must be done wellover ten thousand times per second against a set too large to storein main memory. This requires a distributed architecture, whichfurther complicates the

membership test.

A crucial way to speed up the test is to cache, that is, to store inmain memory a (dynamic) subse t of the “seen” URLs. The maingoal of this paper is to carefully investigate several URL cachingtechniques for web crawling. We consider both practical algorithms:random replacement, static cache, LRU, and CLOCK, andtheoretical limits: clairvoyant caching and infinite cache. We performedabout 1,800 simulations using these algorithms with variouscache sizes, using actual log data extracted from a massive 33day web crawl that issued over one billion HTTP requests.Our main conclusion is that caching is very effective – in oursetup, a cache of roughly 50,000 entries can achieve a hit rate ofalmost 80%. Interestingly, this cache size falls at a critical point: asubstantially smaller cache is much less effective while a substantiallylarger cache brings little additional benefit. We conjecturethat such critical points are inherent to our problem and venture anexplanation for this phenomenon.

1. INTRODUCTION

A recent Pew Foundation study [31] states that “Search engineshave become an indispensable utility for Interne t users” and estimatesthat as of mid-2002, slightly over 50% of all Americans haveused web search to find information. Hence, the technology thatpowers web search is of enormous practical interest. In this paper,we concentrate on one aspect of the search technology, namelythe process of collecting web pages that eventually constitute thesearch engine corpus.

Search engines collect pages in many ways, among them directURL submission, paid inclusion, and URL extraction from nonwebsources, but the bulk of the corpus is obtained by recursivelyexploring the web, a process known as crawling or SPIDERing. Thebasic algorithm is

(a) Fetch a page

(b) Parse it to extract all linked URLs

Crawling typically starts from a set of seed URLs, made up ofURLs obtained by other means as described above and/or made upof URLs collected during previous crawls. Sometimes crawls arestarted from a single well connected page, or a directory such as , but in this case a relatively large portion of the