搜索引擎爬虫外文翻译文献

合集下载

搜索引擎设计与实现外文翻译文献

搜索引擎设计与实现外文翻译文献(文档含英文原文和中文翻译)原文：The Hadoop Distributed File System: Architecture and DesignIntroductionThe Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed onlow-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project.Assumptions and GoalsHardware FailureHardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact t hat there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.Streaming Data AccessApplications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.Large Data SetsApplications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.Simple Coherency ModelHDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. AMap/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.“Moving Computation is Cheaper than Moving Data”A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.Portability Across Heterogeneous Hardware and Software PlatformsHDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.NameNode and DataNodesHDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFSmetadata. The system is designed in such a way that user data never flows through the NameNode.The File System NamespaceHDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.Data ReplicationHDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.Replica Placement: The First Baby StepsThe placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in thesame rack is greater than network bandwidth between machines in different racks.The NameNode determines the rack id each DataNode belongs to via the process outlined in Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.The current, default replica placement policy described here is a work in progress. Replica SelectionTo minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.SafemodeOn startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. ABlockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.The Persistence of File System MetadataThe HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of RAM is plenty to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the NameNode starts up. Work is in progress to support periodic checkpointing in the near future.The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and creates subdirectories appropriately. It is not optimal to create alllocal files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: this is the Blockreport.The Communication ProtocolsAll HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.RobustnessThe primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.Data Disk Failure, Heartbeats and Re-ReplicationEach DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.Cluster RebalancingThe HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.Data IntegrityIt is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.Metadata Disk FailureThe FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.SnapshotsSnapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to apreviously known good point in time. HDFS does not currently support snapshots but will in a future release.Data OrganizationData BlocksHDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semanticson files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode.StagingA client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.The above approach has been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes to files. If a client writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement has been relaxed to achieve higher performance of data uploads.Replication PipeliningWhen a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.AccessibilityHDFS can be accessed from applications in many different ways. Natively, HDFS provides a Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol.FS ShellHDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called FS shell that lets a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample action/command pairs:FS shell is targeted for applications that need a scripting language to interact with the stored data.DFSAdminThe DFSAdmin command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs:Browser InterfaceA typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.Space ReclamationFile Deletes and UndeletesWhen a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash. A file remains in/trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she cannavigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.Decrease Replication FactorWhen the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.译文：Hadoop分布式文件系统:架构和设计一、引言Hadoop分布式文件系统(HDFS)被设计成适合运行在通用硬件(commodity hardware)上的分布式文件系统。

如何查找外文文献

如何查找外文文献
查找外文文献是学术研究的重要环节之一，以下将介绍如何进行外文文献检索和查找的方法。

一、了解文献检索工具
1. 学术引擎：如Google学术、PubMed、Microsoft Academic等。

这些引擎提供了全球范围内的学术出版物，包括期刊文章、学位论文、会议论文等。

2. 文献数据库：如Web of Science、Scopus、IEEE Xplore、ScienceDirect等。

这些数据库提供了大量的学术出版物，并且可以进行更加精确和专业的文献检索。

3.图书馆索引和目录：如大学图书馆的在线目录、OPAC等。

图书馆的资源丰富，通常也有电子文献资源，可以通过图书馆网站进行检索。

二、选择合适的检索词和检索策略
1.检索词的选择：根据研究主题，选择合适的关键词进行检索。

关键词应与所研究的领域相关，可以包括专业术语、主题词、人名、地名等。

2.组合使用检索词：不同的检索词可以组合使用，使用布尔运算符（如AND、OR、NOT）构建逻辑关系，缩小或扩大检索范围，以获得更加精确的检索结果。

三、进行文献检索和筛选
1.首先选择一个合适的文献数据库或引擎，输入相关的检索词。

3.阅读和筛选文献全文，如果文献符合研究需求，可以进一步收集相
关的引用文献。

1.引文索引：在已经找到的高质量文献中，查找其中引用的其他文献。

通过查阅引文索引，可以找到相关的后续研究或者经典文献。

五、利用文献管理工具。

搜索引擎英语作文

搜索引擎英语作文一篇关于搜索引擎的英语文章，掀开了现代的篇章。

下面是给大家整理的搜索引擎英语作文，供大家参阅!搜索引擎英语作文篇1从最直接的网民体验;;搜索速度来说,Google总是比中文搜索引擎百度慢半拍.即使在Google宣布进入中国之后,其主页无法打开的现象仍然时有发生,这是令许多曾是Google忠实用户的网民倒戈百度的主要原因.在mp3搜索、贴吧等个性化的服务方面,百度对中国用户的需求了解的要比Google深入得多.From the experience of netizen-search speed,Google is a little slower than Baidu.Even when Google claims to come to China,some home page can't be open still happen.This makes some fans of Google begin to use Baidu.In MP3 searching,Placard...individual service,Baidu meets national customer's requirement a lot than Google.从搜索技术角度来看,Google一直以过硬的搜索技术为荣,在英文搜索方面确实如此.但是中文作为世界上最复杂的语言之一,在搜索技术方面与英文存在着很大的不同,而Google目前的搜索表现尚不尽如人意.From the search technic,Google is proud of its advantagein searching technic.It's the truth in English searching,but in Chineses- the complex language in the world,there's much difference.At present it can't be satisfied.搜索引擎英语作文篇2Google is an American public corporation,earning revenue from advertising related to its Internet search,e-mail,online mapping,office productivity,social networking,and video sharing services as well as selling advertising-free versions of the same technologies.The Google headquarters,the Googleplex,is located in Mountain View,California.As of December 31,2008,the company has 20,222 full-time employees.Google was co-founded by Larry Page and Sergey Brin while they were students at Stanford University and the company was first incorporated as a privately held company on September 4,1998.The initial public offering took place on August 19,2004,raising US$1.67 billion,making it worth US$23 billion.Google has continued its growth through a series of new product developments,acquisitions,and partnerships.Environmentalism,philanthropy and positive employee relations have been important tenets during the growth of Google,the latter resulting in being identified multiple times as Fortune Magazine's #1 Best Place toWork.The unofficial company slogan is "Don't be evil",although criticism of Google includes concerns regarding the privacy of personal information,copyright,censorship and discontinuation of services.According to Millward Brown,it is the most powerful brand in the world.搜索引擎英语作文篇3如果你想搜索信息在互联网上，你可以使用搜索引擎。

免费的7个中英文文献资料检索网站，值得您收藏

免费的7个中英文文献资料检索网站，值得您收藏写作学术论文离不开文献资料的查找使用。

那么除了在知网、万方、维普等国内数据库以及百度文库等进行文献检索外，还有没有其他的比较好的文献引擎呢？特别是搜索外文文献的网站？答案是肯定的。

今天易起论文的小编就为大家推荐7个「学术文献检索工具」。

1.Citeseerx「Citeseerx」官网首页/CiteSeerX是CiteSeer的换代产品。

CiteSeerX与CiteSeer一样，也公开在网上提供完全免费的服务，实现全天24h实时更新。

CiteSeer引文搜索引擎由美国普林斯顿大学NEC研究院研制开发。

CiteSeer引文搜索引擎是利用自动引文标引系统(ACI)建立的第一个学术论文数字图书馆。

CiteSeerX采用机器自动识别技术搜集网上以Postscrip和PDF文件格式存在的学术论文，然后依照引文索引方法标引和链接每一篇文章。

CiteSeerX的宗旨就在于有效地组织网上文献，多角度促进学术文献的传播与反馈。

▼CiteSeerX的检索界面简洁清晰默认为文献（Documents）检索还支持Authours、tables检索若选择“IncludeCitations”进行搜索期刊文献等检索范围会扩大不仅包括学术文献全文的数据库还会列出数据库中每篇论文的参考文献点击“AdvancedSearch”进入高级检索界面，可以看到CiteSeerX支持以下检索字段的“并”运算：篇名、作者、作者单位、期刊或会议录名称、出版年、文摘、关键词、文本内容以及用户为论文定义的标签(Tag)。

当然也可以在首页的单一检索框自行构造组合检索式，如Author:(jkleinberg)ANDvenue:(journaloftheacm)。

点击“AdvancedSearch”进入高级检索界面高级检索会增加检索的精确度，除了支持作者、作者单位、篇名等基本检索之外，还支持文本内容以及用户为论文定义的标签等更为详细的检索。

谷歌学术外文文献检索技巧

谷歌学术外文文献检索技巧
谷歌学术是一个强大的外文文献检索工具，以下是一些在谷歌学
术中获取最佳结果的技巧：
1. 使用关键词搜索：使用相关的关键词来搜索你感兴趣的主题。

关键词应准确描述你想了解的主题，选择适当的词汇以获得更精确的
结果。

2. 使用双引号搜索精确短语：如果你想搜索一段具体的短语，
可以将它们放在双引号中。

这样搜索引擎会更准确地匹配你的搜索内容。

3. 使用排除词限制结果：如果你想排除某些词汇，可以在搜索
中使用减号（-）。

例如，搜索"气候变化"，但不包括"极端气候变化"，可以输入"气候变化 -极端"。

4. 使用站点限制搜索：如果你只想在特定网站上搜索，可以使用"site:"后跟网站域名来限制搜索范围。

例如，搜索"葡萄酒
site:"将限制在维基百科上搜索与葡萄酒相关的内容。

5. 查看相关文献和引用：谷歌学术会显示与所选择的文章相关
的其他文献和引用。

通过查看这些相关文献，你可以找到更多有用的
信息和引用。

6. 添加限制器过滤结果：通过使用谷歌学术的过滤器，可以更
精确地控制结果。

你可以根据发表年份、作者、期刊等进行过滤。

7. 了解高级搜索选项：点击谷歌学术搜索框右侧的三个竖点图标，可以打开高级搜索选项。

这些选项可以帮助你更精确地搜索和过
滤结果。

总之，通过利用谷歌学术的搜索技巧和功能，你可以更方便地获
取到所需的外文文献。

关于爬虫的外文文献

关于爬虫的外文文献爬虫技术作为数据采集的重要手段，在互联网信息挖掘、数据分析等领域发挥着重要作用。

本文将为您推荐一些关于爬虫的外文文献，以供学习和研究之用。

1."Web Scraping with Python: Collecting Data from the Modern Web"作者：Ryan Mitchell简介：本书详细介绍了如何使用Python进行网页爬取，从基础概念到实战案例，涵盖了许多常用的爬虫技术和工具。

通过阅读这本书，您可以了解到爬虫的基本原理、反爬虫策略以及如何高效地采集数据。

2."Scraping the Web: Strategies and Techniques for Data Mining"作者：Dmitry Zinoviev简介：本书讨论了多种爬虫策略和技术，包括分布式爬虫、增量式爬虫等。

同时，还介绍了数据挖掘和文本分析的相关内容，为读者提供了一个全面的爬虫技术学习指南。

3."Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Instagram, Pinterest, and More"作者：Matthew A.Russell简介：本书主要关注如何从社交媒体平台（如Facebook、Twitter 等）中采集数据。

通过丰富的案例，展示了如何利用爬虫技术挖掘社交媒体中的有价值信息。

4."Crawling the Web: An Introduction to Web Scraping and Data Mining"作者：Michael H.Goldwasser, David Letscher简介：这本书为初学者提供了一个关于爬虫技术和数据挖掘的入门指南。

内容包括：爬虫的基本概念、HTTP协议、正则表达式、数据存储和数据分析等。

文献检索外文

二．1、进入“郑州轻工业学院”主页，然后进入“图书馆”。

2、在图书馆中选择“外文资源”，之后选择“Elsevier SDOL电子期刊数据库”；3、进入之后在该界面中检索在all fills 中输入“data structure”可检索到结果3,356,357 articles found for: ALL(data structure)A、《Data structures and DBMS for computer-aided design systems》C.J.AnumbaB、《Dynamic adaptive data structures for monitoring data streams》P.Trancoso 、V.Muntes-Mulero、rriba-PeyC、《Hybrid structure:data structuring for data flow》B.Lee、AR.Hurson2、（1）、输入basic computer organization，《The Basic Organization of Digital Computers》 Hans W.Gschwind 1967 AbstractSo far, we have concerned ourselves with the details of logic design, that is, we have developed the skills and the techniques to implement individual computer subunits. But no matter how well we understand the details of these circuits, the overall picture of a computer will be rather vague as long as we do not perceive the systematic organization of its components. If we want not only to recognize the structure ofa specific machine, but also attempt to judge the significance ofvariations in the layout of different machines, it is further necessaryto comprehend the philosophy of their design 3、输入“computer science and technology”A、【作者】计算机科学与技术学院编【出版社】科学出版社【出版日期】2004年12月第1版【ISBN】7030145518B、【作者】ROBERT ROSENTHAL【出版社】U.S.DEPARTMENT OF COMMERCE 【出版日期】1982C、【作者】JULES ARONSON【出版社】U.S.DEPARTMENT OF COMMERCE 【出版日期】1977。

外文文献翻译技巧

五分钟搞定5000字－外文文献翻译工具大全建议收藏在科研过程中阅读翻译外文文献是一个非常重要的环节，许多领域高水平的文献都是外文文献，借鉴一些外文文献翻译的经验是非常必要的。

由于特殊原因我翻译外文文献的机会比较多，慢慢地就发现了外文文献翻译过程中的三大利器：Google“翻译”频道、金山词霸（完整版本）和CNKI“翻译助手"。

具体操作过程如下：1.先打开金山词霸自动取词功能，然后阅读文献；2.遇到无法理解的长句时，可以交给Google处理，处理后的结果猛一看，不堪入目，可是经过大脑的再处理后句子的意思基本就明了了；3.如果通过Google仍然无法理解，感觉就是不同，那肯定是对其中某个“常用单词”理解有误，因为某些单词看似很简单，但是在文献中有特殊的意思，这时就可以通过CNKI的“翻译助手”来查询相关单词的意思，由于CNKI的单词意思都是来源与大量的文献，所以它的吻合率很高。

另外，在翻译过程中最好以“段落”或者“长句”作为翻译的基本单位，这样才不会造成“只见树木，不见森林”的误导。

注：1、Google翻译：google，众所周知，谷歌里面的英文文献和资料还算是比较详实的。

我利用它是这样的。

一方面可以用它查询英文论文，当然这方面的帖子很多，大家可以搜索，在此不赘述。

回到我自己说的翻译上来。

下面给大家举个例子来说明如何用吧比如说“电磁感应透明效应”这个词汇你不知道他怎么翻译，首先你可以在CNKI里查中文的，根据它们的关键词中英文对照来做，一般比较准确。

在此主要是说在google里怎么知道这个翻译意思。

大家应该都有词典吧，按中国人的办法，把一个一个词分着查出来，敲到google里，你的这种翻译一般不太准，当然你需要验证是否准确了，这下看着吧，把你的那支离破碎的翻译在google里搜索，你能看到许多相关的文献或资料，大家都不是笨蛋，看看，也就能找到最精确的翻译了，纯西式的！我就是这么用的。

2、CNKI翻译：CNKI翻译助手，这个网站不需要介绍太多，可能有些人也知道的。

网络爬虫外文翻译参考文献

网络爬虫外文翻译参考文献(文档含英文原文和中文翻译)译文：探索搜索引擎爬虫随着网络难以想象的急剧扩张，从Web中提取知识逐渐成为一种受欢迎的途径。

这是由于网络的便利和丰富的信息。

通常需要使用基于网络爬行的搜索引擎来找到我们需要的网页。

本文描述了搜索引擎的基本工作任务。

概述了搜索引擎与网络爬虫之间的联系。

关键词：爬行，集中爬行，网络爬虫1.导言在网络上WWW是一种服务，驻留在链接到互联网的电脑上，并允许最终用户访问是用标准的接口软件的计算机中的存储数据。

万维网是获取访问网络信息的宇宙，是人类知识的体现。

搜索引擎是一个计算机程序，它能够从网上搜索并扫描特定的关键字，尤其是商业服务，返回的它们发现的资料清单，抓取搜索引擎数据库的信息主要通过接收想要发表自己作品的作家的清单或者通过“网络爬虫”、“蜘蛛”或“机器人”漫游互联网捕捉他们访问过的页面的相关链接和信息。

网络爬虫是一个能够自动获取万维网的信息程序。

网页检索是一个重要的研究课题。

爬虫是软件组件，它访问网络中的树结构，按照一定的策略，搜索并收集当地库中检索对象。

本文的其余部分组织如下：第二节中，我们解释了Web爬虫背景细节。

在第3节中，我们讨论爬虫的类型，在第4节中我们将介绍网络爬虫的工作原理。

在第5节，我们搭建两个网络爬虫的先进技术。

在第6节我们讨论如何挑选更有趣的问题。

2.调查网络爬虫网络爬虫几乎同网络本身一样古老。

第一个网络爬虫，马修格雷浏览者，写于1993年春天，大约正好与首次发布的OCSA Mosaic网络同时发布。

在最初的两次万维网会议上发表了许多关于网络爬虫的文章。

然而，在当时，网络i现在要小到三到四个数量级，所以这些系统没有处理好当今网络中一次爬网固有的缩放问题。

显然，所有常用的搜索引擎使用的爬网程序必须扩展到网络的实质性部分。

但是，由于搜索引擎是一项竞争性质的业务，这些抓取的设计并没有公开描述。

有两个明显的例外：股沟履带式和网络档案履带式。

外文翻译--网络服务的爬虫引擎

外文资料WSCE: A Crawler Engine for Large-Scale Discovery of Web ServicesEyhab Al-Masri and Qusay H. MahmoudAbstractThis paper addresses issues relating to the efficient access and discovery of Web services across multiple UDDI Business Registries (UBRs). The ability to explore Web services across multiple UBRs is becoming a challenge particularly as size and magnitude of these registries increase. As Web services proliferate, finding an appropriate Web service across one or more service registries using existing registry APIs (i.e. UDDI APIs) raises a number of concerns such as performance, efficiency, end-to-end reliability, and most importantly quality of returned results. Clients do not have to endlessly search accessible UBRs for finding appropriate Web services particularly when operating via mobile devices. Finding relevant Webservices should be time effective and highly productive. In an attempt to enhance the efficiency of searching for businesses and Web services across multiple UBRs, we propose a novel exploration engine, the Web Service Crawler Engine (WSCE). WSCE is capable of crawling multiple UBRs, and enables for the establishment of a centralized Web services’repository which can be used for large-scale discovery of Web services. The paper presents experimental validation, results, and analysis of the presented ideas.1. IntroductionThe continuous growth and propagation of the internet have been some of the main factors for information overload which at many instances act as deterrents for quick and easy discovery of information. Web services are internet-based, modular applications, and the automatic discovery and composition of Web services are an emerging technology of choice for building understandable applications used for business-to-business integration and are of an immense interest to governments, businesses, as well as individuals. As Web services proliferate, the same dilemma perceived in the discovery of Web pages will become tangible and the searching for specific business applications or Web services becomes challenging and time consuming particularly as the number of UDDI Business Registries (UBRs) begins to multiply.In addition, decentralizing UBRs adds another level of complexity on how to effectively find Web services within these distributed registries. Decentralization of UBRs is becoming tangible as new operating systems, applications, and APIs are already equipped with built-in functionalities and tools that enable organizations orbusinesses to publish their own internal UBRs for intranet and extranet use such as the Enterprise UDDI Services in Windows Server 2003, WebShpere Application Server, Systinet Business Registry, jUDDI, to name a few. Enabling businesses or organizations to self-operate and mange their own UBRs will maximize the likelihood of having a significant increase in the number of business registries and therefore, clients will soon face the challenge of finding Web services across hundreds, if not thousands of UBRs.At the heart of the Service Oriented Architecture (SOA) is a service registry which connects and mediates service providers with clients as shown in Figure 1. Service registries extend the concept of an application-centric Web by allowing clients (or conceivably applications) to access a wide range of Web services that match specific search criteria in an autonomous manner.Without publishing Web services through registries, clients will not be able to locate services in an efficient manner, and service providers will have to devote extra efforts in advertising their services through other channels. There are several companies that offer Web-based Web service directories such as WebServiceList [1], RemoteMethods [2], WSIndex [3], and [4]. However, due to the fact that these Web-based service directories fail to adhere to Web services’ standards such as UDDI, it is likely that they become vulnerable to being unreliable sources forfinding relevant Web services, and may become disconnected from the Web services environment as in the cases of BindingPoint and SalCentral which closed their Web-based Web service directories after many years of exposure.Apart from having Web-based service directories, there have been numerous efforts that attempted to improve the discovery of Web services [5,6,9,21], however, many of them have failed to address the issue of handling discovery operations across multiple UBRs. Due to the fact that UBRs are hosted on Web servers, they aredependent on network traffic and performance, and therefore, clients that are looking for appropriate Web services are susceptible to performance issueswhen carrying out multiple UBR search requests. To address the above-mentioned issues, this work introduces a framework that serves as the heart of our Web Services Repository Builder (WSRB) architecture [7] by enhancing the discovery of Web services without having any modifications to exiting standards. In this paper, we propose the Web Service Crawler Engine (WSCE) which actively crawls accessible UBRs and collects business and Web service information. Our architecture enables businesses and organizations to maintain autonomous control over their UBRs while allowing clients to perform search queries adapted to large-scale discovery of Web services. Our solution has been tested and results present high performance rates when compared with other existing models.The remainder of this paper is organized as follows. Section two discusses related work. Section three discusses some of the limitations with existing UBRs. Section four discusses the motivations for WSCE. Section five presents our Web service crawler engine’s architecture. Experiments and results are discussed in Section six, and finally conclusion and future work are discussed in Section seven.2. Related WorkDiscovery of Web services is a fundamental area of research in ubiquitous computing. Many researchers have focused on discovering Web services through a centralized UDDI registry [8,9,10]. Although centralized registries can provide effective methods for the discovery of Web services, they suffer from problems associated with having centralized systems such as single point of failure, and bottlenecks. In addition, other issues relating to the scalability of data replication, providing notifications to all subscribers when performing any system upgrades, and handling versioning of services from the same provider have driven researchers to find other alternatives. Other approaches focused on having multiple public/private registries grouped into registry federations [6,12] such as METEOR-S for enhancing the discovery process. METEOR-S provides a discovery mechanism for publishing Web services over federated registries but this solution does not provide the means for articulating advanced search techniques which are essential for locating appropriate business applications. In addition, having federated registry environments can potentially provide inconsistent policies to be employed which will have a significant impact on the practicability of conducting inquiries across them. Furthermore, federated registry environments will have increased configuration overhead, additional processing time, and poor performance in terms of execution time when performing service discovery operations. A desirabl e solution would be a Web services’ crawler engine such as WSCE that can facilitate the aggregation of Web service references, resources, and description documents, and can provide clients with a standard,universal access point for discovering Web services distributed across multiple registries.Several approaches focused on applying traditional Information Retrieval (IR) techniques or using keyword-based matching [13,14] which primarily depend on analyzing the frequency of terms. Other attempts focused on schema matching [15,16] which try to understand the meanings of the schemas and suggest any trends or patterns. Other approaches studied the use of supervised classification and unsupervised clustering of Web services [17], artificial neural networks [18], or using unsupervised matching at the operation level [19].Other approaches focused on the peer-to-peer framework architecture for service discovery and ranking [20], providing a conceptual model based on Web service reputation [21], and providing keyword-based search engine for querying Web services [22]. However, many of these approaches provide a very limited set of search methods (i.e. search by business name, business location, etc.) and attempt to apply traditional IR techniques that may not b e suitable for services’ discovery since Web services often contain or provide very brief textual description of what they offer. In addition, the Web services’ structure is complex and only a small portion of text is often provided.WSCE enhances the process of discovering Web services by providing advanced search capabilities for locating proper business applications across one or more UDDI registries and any other searchable repositories. In addition, WSCE allows for high performance and reliable discovery mechanism while current approaches are mainly dependent on external resources which in turn can significantly impact the ability to provide accurate and meaningful results. Furthermore, current techniques do not take into consideration the ability to predict, detect, recover from failures at the Web service host, or keep track of any dynamic updates or service changes.3. UDDI Business Registries (UBRs)Business registries provide the foundation for the cataloging and classification of Web services and other additional components. A UDDI Business Registry (UBR) serves as a service directory for the publishing of technical information about Web services [23]. The UDDI is an initiative originally backed up by several technology companies including Microsoft, IBM, and Ariba [24] and aims at providing a focal point where all businesses, including their Web services meet together in an open and platform-independent framework. Hundreds of other companies have endorsed the UDDI initiative including HP, Intel, Fujitsu, BEA, Oracle, SAP, Nortel Networks, WebMethods, Andersen Consulting, Sun Microsystems, to name a few. E-Business XML (ebXML) is another service registry standard that focuses more on the collaboration between businesses [27]. Although commonalities between UDDI and ebXML registries present opportunities for interoperability between them [26], theUDDI remains the de facto industry standard for Web service discovery [21]. Although the UDDI provides ways for locating businesses and how to interface with them electronically, it is limited to a single search criterion. Keyword-based search techniques offered by UDDI will make it impractical to assume that it can be very useful for Web services’ discovery or composition. In addition, a client does not have to endlessly search UBRs for finding an appropriate Web service. As Web services proliferate and the number of UBRs increases, limited search capabilities are likely to yield less meaningful search results which makes the task of performing search queries across one or multiple UBRs very time consuming, and less productive.3.1. Limitations with Current UDDIApart from the problems regarding limited search capabilities offered by UDDI, there are other major limitations and shortcomings with the existing UDDI standard. Some of these limitations include: (1) UDDI was intended to be used only for Web services’ discovery; (2) UDDI registration is voluntary, and therefore, it risks becoming passive; (3) UDDI does not provide any guarantees to the validity and quality of information it contains; (4) the disconnection between UDDI and the current Web; (5) UDDI is incapable of providing Quality of Service (QoS) measurements for registered Web services, which can provide helpful information to clients when choosing appropriate Web services, (6) UDDI does not clearly define how service providers can advertise pricing models; and (7) UDDI does not maintain nor provide any Web service life-cycle information (i.e. Web services across stages). Other limitations with the current UDDI standard [23] are shown in Table 1.Although the UDDI has been the de facto industry standard for Web services’ discovery, the ability to find a scalable solution for handling significant amounts of data from multiple UBRs at a large-scale is becoming a critical issue. Furthermore, the search time when searching one or multiple UDDI registries (i.e. meta-discovery) raises several concerns in terms performance, efficiency, reliability and the quality of returned results.4. Motivations for WSCEWeb services are syntactically described using the Web Service Description Language (WSDL) which concentrates on describing Web services at the functional level. A more elaborate business-centric model for Web services is provided by the UDDI which allows businesses to create many-to-many partnership relationships and serves as a focal point where all businesses of all sizes can meet together in an open and a global framework. Although there have been numerous standards that support the description and discovery of Web services, combining these sources of information in a simple manner for clients to apprehend and use is not currently present. In order for clients to search or invoke services, first they have to manually perform search queries to an existing UBR based on a primitive keyword-based technique, loop through returned results, extract binding information (i.e. through bindingTemplates or via WSDL access points), and manually examine their technical details. In this case, clients have to manually collect Web service information from different types of resources which may not be a reliable approach for collecting information about Web services. What is therefore desirable is a Web services’ crawler engine such as WSCE that facilit ates the aggregation of Web service references, resources, and description documents and provides a well defined access pattern of usages on how to discover Web services. WSCE facilitates the establishment of a Web services’ search engine in which service providers will have enough visibility for their services, and at the same time clients will have the appropriate tools for performing advanced search queries. The crucial design of WSCE is motivated by several factors including: (1) the inability to periodically keep track of business and Web service life-cycle using existing UDDI design, which can provide extremely helpful information serving as the basis for documenting Web services across stages; (2) the inherent search criterion offered by UDDI inquiry API which would not be beneficial for finding services of interest; (3) the apparent disconnection between UBRs from the existing Web; and (4) performance issues with real-time search queries across multiple UBRs which will eventually become very time consuming as the number of UBRs increases while UDDI clients may not have the potential of searching all accessible UBRs. Other factors of motivation will become apparent as we introduce WSCE.5. Web Service Crawler Engine (WSCE)The Web Service Crawler Engine (WSCE) is part of the Web Services Repository Builder (WSRB) [7,11] in which it actively crawls accessible UBRs, and collects information into a centralized repository called Web Service Storage (WSS). The discovery of Web services in principle can be achieved through a number of approaches. Resources that can be used to collect Web service information may vary but all serve as an aggregate for Web service information. Prior to explaining the details of WSCE, it is important to discuss current Web service data resources that can be used for implementing WSCE.5.1. Web Service ResourcesFinding information about Web services is not strictly tied to UBRs. There are other standards that support the description, discovery of businesses, organizations, service providers, and their Web services which they make available, while interfaces that contain technical details are used for allowing the proper access to those services. For example, WSDL describes message operations, network protocols, and access points to addresses used by Web services; XML Schemas describe the grammatical XML structure sent and received by Web services; WS-Policy describes general features, requirements, and capabilities of Web services; UDDI business registries describe a more business-centric model for publishing Web services; WSDL-Semantics (WSDL-S) uses semantic annotations that define meaning of inputs, outputs, preconditions, and effects of operations described by an interface.WSCE uses UBRs and WSDL files as data sources for collecting information into the Web Service Storage (WSS) since they contain the necessary information for describing and discovering Web services. We investigated several methods for obtaining Web service information and findings are summarized below:• Web-based: Web-based crawling involved using an existing search engine API to discover WSDL files across the Web such as Google SOAP Search API. Unfortunately, a considerable amount of WSDL files crawled over the Web did not contain descriptions of what these Web services have to offer. In addition, a large amount of crawled Web services contain outdated, passive, or incomplete information. About 340 Web services were collected using this method and only 23% of the collected WSDL files contained an adequate level of documentation.• File Sharing: File sharing tools such as Kazaa and Emule provide search capability by file types. A test was performed by extracting WSDL files using these file sharing tools, and approximately 56 Web services were collected. Unfortunately, peer-to-peer file sharing platforms provide variable network performances, the amount of information being shared is partial, and availability of original sourcescould not be guaranteed at all times.• UDDI Business Registries: Craw ling UBRs was one of the best methods used for collecting Web service information. There are several UBRs that currently exist that were used for this method including: Microsoft, XMethods, SAP, National Biological Information Infrastructure (NBII), among others. Using this method, we were able to retrieve 1248 Web services and collect information about them.Another possible method for collecting Web service information is to crawl Web-based service directories or search engines such as Woogle [19], WebServiceList [1], RemoteMethods [2], or others. However, due to the fact that there is no public access to these directories that contain Web service listings, we excluded this method from WSCE. In addition, majority of the Web services listed within these directories were already available through the Web (via wsdl files) or listed in existing UBRs. Unfortunately, many of these Web-based service directories do not adhere to the majority of Web services’ standards, and therefore, it becomes impractical to us e them for crawling Web services. Although the above-mentioned methods provide possible ways to collect Web service information, UBR sremain the best existing approach for WSCE. However, there are some challenges associated with collecting information about Web services. One of these challenges is the insufficient textual description associated with Web services which makes the task of indexing them very complex and tricky at many instances. To overcome this issue, WSCE uses a combination of indexing approaches by analyzing Web services based on different measures that make the crawling of Web services much more achievable.5.2. Collection of Web ServicesCollecting Web services’ data is not the key element that leads to an effective Web services’ discovery, but how this data is stored. The fact that Web services’ data is spread all over existing search engines’ databases, accessible UBRs, or file sharing platforms does not mean that clients are able to find Web services without difficulties. However, ma king this Web services’ data available through a standard, universal access point that is capable of aggregating this data from various sources and enabling clients to execute search queries tailored to their requirements via a Web services’ search engine interface facilitated by a Web services’ crawler engine or WSCE is a key element for enhancing the discovery of Web services and accelerating their adoption.Designing a crawler engine for Web services’ discovery is very complex and requires special attention particularly looking at the current Web crawler designs. When designing WSCE, it became apparent that many of the existing information retrieval models that serve as basis for Web crawlers may not be very suitable when it comes to Web services due to key differences that exist between Web services andWeb pages including:• Web pages often contain long textual information while Web services have very brief textual descriptions of what they offer or little documentation on how it can be invoked. This lack of textual information makes keyword-based searches vulnerable to returning irrelevant search results and therefore become a very primitive method for effectively discovering Web services.• Web pages primarily contain plain text which allows sear ch engines to take advantage of information retrieval methods such as finding document and term frequencies. However, Web services’ structure is much more complex than those of Web pages, and only a small portion of plain text is often provided either in UBRs or service interfaces.• Web pages are built using HTML which has a predefined or known set of tags. However, Web service definitions are much more abstract. Web service interface information such as message names, operation and parameter names within Web services can vary significantly which makes the finding of any trends, relationships, or patterns within them very difficult and requires excessive domain knowledge in XML schemas and namespaces.Applying Web crawling techniques to Web service definitions or WSDL files, and UBRs may not be efficient, and the outcome of our research was an enhanced crawler targeted for Web services. The crawler should be able to handle WSDL files, and UBR information concurrently. In addition to that, the crawler should be able to collect this information from multiple registries and store it into a centralized repository, such as the Web Service Storage (WSS).WSS serves as a central repository where data and templates discovered by WSCE are stored. WSS represents a collection or catalogue of all business entries and related Web services as shown in Figure 2. WSS plays a major role in the discovery of Web services in many ways: first: it enables for the identification of Web services through service descriptions and origination, processing specification, device orientation, binding instructions, and registry APIs; second: it allows for the query and location of Web services through classification; third: it provides the means for Web service life-cycle tracking; fourth: it provides dynamic Web service invocation and binding; fifth: it can be used to execute advanced or range-based search queries; sixth: it enables the provisioning of Web services by capturing multiple data types, and seventh: it can be used by businesses to perform Web service notifications. The WSS also takes advantage of some of the existing mechanisms that utilize context-aware information using artificial neural networks [18] for enhancing the discovery process 5.3. WSCE ArchitectureWSCE is an automated program that resides on a server, roams across accessible UDDI business registries, and extracts information about all businesses and theirrelated Web services to a local repository or WSS [11]. Our approach in implementing this conceptual discovery model shown in Figure 2 is a process-per-service design in which WSCE runs each Web service crawl as a process that is managed and handled by the WSCE’s Event and Load Manager (ELM). The crawling process starts with dispensing Web services into the Ws ToCrawl queue. WSCE’s Seed Ws list contains hundreds or thousands of business keys, services keys, and the corresponding UBR inquiry location.WSCE begins with a collection of Web services and loops through taking a Web service from WsToCrawl queue. WSCE then starts analyzing Web serviceinformation located within the registry, tModels, and any associated WSDL information through the Analysis Module. WSCE stores this information in the WSS after processing it through the Indexing Module. After completion, WSCE adds an entry of the Web service (using unique identifier such as serviceKey) into VisitedWs queue.5.3.1. Building Crawler Queues: WsToCrawlConceptually, WSCE examines all Web services from accessible UBRs through businessKeys and serviceKeys and checks whether any new businessKeys orserviceKeys are extracted. If the businessKey or serviceKey has already been fetched, it is discarded; otherwise, it is added to the WsToCrawl queue. WSCE contains a queue of VisitedWs which includes a list of crawled Web services. In cases the crawler process fails or crashes, information is lost, and therefore, Event and Load Manager (ELM) handles such scenarios and updates the WsToCrawl through the Extract Ws component.5.3.2. Event and Load ManagerEvent and Load Manager (ELM) is responsible for managing queues of Web services that are to be fetched and checks out enough work from accessible UBRs. This process continues until the crawler has collected a sufficient number of Web services. ELM also keeps track of Web-based fetching through WSDL files. ELM also determines any changes that occurred within UBRs and allows the fetching of newly or recently modified information. This can provide extremely helpful information that serve as the basis for documenting services across stages for service life-cycle tracking. In addition, ELM is able to automate registry maintenance tasks such as moderation, monitoring, or enforcing service expirations. ELM uses the GetWs component to communicate with Extract Ws and WSS for parsing and storing purposes.5.3.3. Request WsRequest Ws component begins the fetching process by querying the appropriate UBR for collecting basic business information such as businessKey, name, contact, description, identifiers, associated categories, and others. Request WS component makes calls to the Indexing and Analysis Modules for collecting business related information. This information is then handled by ELM which acts as a filter and an authorization point prior to sending any information to the local repository, or WSS. Once business information is stored, Extract Ws is used to extract business related Web services. Request Ws enables WSS to collect business information which can later be used for performing search queries that are business related.5.3.4. Extract WsExtract Ws receives all serviceKeys associated with a given businessKey. Extract Ws parses Web service information such as serviceKey, service name, description, categories, and others. It also extracts a list of all associated bindingTemplates. Extract Ws parses all technical information associated with given Web services, discovers how to access and use them. The Extract Ws depends on the Analysis Module to determine any mapping differences between bindingTemplates and information contained within WSDL files stored in WSS.5.3.5. Analysis ModuleAnalysis Module (AM) serves two main purposes: analyzing tModels withinUBRs and WSDL file structures. In terms of tModels, AM extracts a list of entries that correspond to tModels associated with bindingTemplates. In addition, AM extracts information on the categoryBag which contains data about the Web service and its binding (i.e. test binding, production binding, and others). AM discovers any similarities between analyzed tModel and other tModels of other Web services of the same or different business entities. In addition, AM examines information such as the tModelKey, name, description, overviewDoc (including any associated overview URL), identifierBag, and categoryBag for the purpose of creating a topology of relationships to similar tModels of interest.AM analyzes the syntactical structure of WSDL files to collect information on messages, input and output parameters, attributes, values, and others. All WSDL files are stored in the WSS and ELM keeps track of file dates (i.e. file creation date). In cases an update is made to WSDL files, ELM stores the new WSDL file and archives the old one. Archiving WSDL files can become very helpful for versioning support for Web services and can also serve Web service life-cycle analysis. AM also extracts information such as the type of system used to describe the data model (i.e. any XML schemas, UBL, etc…), the location of where the service provider resides, how to bind WSDL to protocols (i.e. SOAP, HTTP POST/GET, and MIME), and groups of locations that can be used to access the service. This information can be very helpful for developers or businesses for efficiently locating appropriate business applications,enhancing the business-to-business integration process, or for service composition.5.3.6. Indexing ModuleIndexing Module (IM) depends on information retrieval techniques for indexing textual information contained within WSDL files, and UBRs. IM builds an index of keywords on documents, uses vector space model technique proposed by Salton [25] for finding terms with high frequencies, classifies information into appropriate associations, and processes textual information for possible integration with other processes such as ranking and analysis. IM enables WSCE to provide businesses with an effective analysis tool for measuring e-commerce performance, Web service effectiveness, and other marketing tactics. IM can also serve many purposes and allows for the possible integration with external components (i.e. Semantic Web service tools) for building ontologies to extend the usefulness of WSCE.6. Experiments and ResultsData used in this work are based on actual implementations of existing UBRs including: Microsoft, Microsoft Test, , and SAP. To compare performance of existing UBRs (in terms of execution time) to the WSRB framework, we measured the average time when performing search queries. The ratio has a direct effect on measurements since each UBR contains different number of Web services published. Therefore, only the top 10% of the dataset matched is used. Results are。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

搜索引擎爬虫外文翻译文献(文档含中英文对照即英文原文和中文翻译)译文：探索搜索引擎爬虫随着网络难以想象的急剧扩张，从Web中提取知识逐渐成为一种受欢迎的途径。

这是由于网络的便利和丰富的信息。

通常需要使用基于网络爬行的搜索引擎来找到我们需要的网页。

本文描述了搜索引擎的基本工作任务。

概述了搜索引擎与网络爬虫之间的联系。

万维网是获取访问网络信息的宇宙，是人类知识的体现。

网络爬虫是一个能够自动获取万维网的信息程序。

网页检索是一个重要的研究课题。

爬虫是软件组件，它访问网络中的树结构，按照一定的策略，搜索并收集当地库中检索对象。

本文的其余部分组织如下：第二节中，我们解释了Web爬虫背景细节。

在第3节中，我们讨论爬虫的类型，在第4节中我们将介绍网络爬虫的工作原理。

在第5节，我们搭建两个网络爬虫的先进技术。

在第6节我们讨论如何挑选更有趣的问题。

2.调查网络爬虫网络爬虫几乎同网络本身一样古老。

第一个网络爬虫，马修格雷浏览者，写于1993年春天，大约正好与首次发布的OCSA Mosaic网络同时发布。

在最初的两次万维网会议上发表了许多关于网络爬虫的文章。

然而，在当时，网络i现在要小到三到四个数量级，所以这些系统没有处理好当今网络中一次爬网固有的缩放问题。

显然，所有常用的搜索引擎使用的爬网程序必须扩展到网络的实质性部分。

但是，由于搜索引擎是一项竞争性质的业务，这些抓取的设计并没有公开描述。

有两个明显的例外：股沟履带式和网络档案履带式。

不幸的是，说明这些文献中的爬虫程序是太简洁以至于能够进行重复。

原谷歌爬虫（在斯坦福大学开发的）组件包括五个功能不同的运行流程。

服务器进程读取一个URL出来然后通过履带式转发到多个进程。

每个履带进程运行在不同的机器，是单线程的，使用异步I/O采用并行的模式从最多300个网站来抓取数据。

爬虫传输下载的页面到一个能进行网页压缩和存储的存储服务器进程。

然后这些页面由一个索引进程进行解读，从HTML页面中提取链接并将他们保存到不同的磁盘文件中。

一个URL解析器进程读取链接文件，并将相对的网址进行存储，并保存了完整的URL到磁盘文件然后就可以进行读取了。

通常情况下，因为三到四个爬虫程序被使用，所有整个系统需要四到八个完整的系统。

在谷歌将网络爬虫转变为一个商业成果之后，在斯坦福大学仍然在进行这方面的研究。

斯坦福Web Base项目组已实施一个高性能的分布式爬虫，具有每秒可以下载50到100个文件的能力。

Cho等人又发展了文件更新频率的模型以报告爬行下载集合的增量。

互联网档案馆还利用多台计算机来检索网页。

每个爬虫程序被分配到64个站点进行检索，并没有网站被分配到一个以上的爬虫。

每个单线程爬虫程序读取到其指定网站网址列表的种子从磁盘到每个站点的队列，然后用异步I/O来从这些队列同时抓取网页。

一旦一个页面下载完毕，爬虫提取包含在其中的链接。

如果一个链接提到它被包含在页面中的网站，它被添加到适当的站点排队；否则被记录在磁盘。

每隔一段时间，合并成一个批处理程序的具体地点的种子设置这些记录“跨网站”的网址，过滤掉进程中的重复项。

Web Fountian爬虫程序分享了魔卡托结构的几个特点：它是分布式的，连续，有礼貌，可配置的。

不幸的是，写这篇文章，WebFountain是在其发展的早期阶段，并尚未公布其性能数据。

3.搜索引擎基本类型A.基于爬虫的搜索引擎基于爬虫的搜索引擎自动创建自己的清单。

计算机程序“蜘蛛”建立他们没有通过人的选择。

他们不是通过学术分类进行组织，而是通过计算机算法把所有的网页排列出来。

这种类型的搜索引擎往往是巨大的，常常能取得了大龄的信息，它允许复杂的搜索范围内搜索以前的搜索的结果，使你能够改进搜索结果。

这种类型的搜素引擎包含了网页中所有的链接。

所以人们可以通过匹配的单词找到他们想要的网页。

B.人力页面目录这是通过人类选择建造的，即他们依赖人类创建列表。

他们以主题类别和科目类别做网页的分类。

人力驱动的目录，永远不会包含他们网页所有链接的。

他们是小于大多数搜索引擎的。

C.混合搜索引擎一种混合搜索引擎以传统的文字为导向，如谷歌搜索引擎，如雅虎目录搜索为基础的搜索引擎，其中每个方案比较操作的元数据集不同，当其元数据的主要资料来自一个网络爬虫或分类分析所有互联网文字和用户的搜索查询。

与此相反，混合搜索引擎可能有一个或多个元数据集，例如，包括来自客户端的网络元数据，将所得的情境模型中的客户端上下文元数据俩认识这两个机构。

4.爬虫的工作原理网络爬虫是搜索引擎必不可少的组成部分：运行一个网络爬虫是一个极具挑战的任务。

有技术和可靠性问题，更重要的是有社会问题。

爬虫是最脆弱的应用程序，因为它涉及到交互的几百几千个Web服务器和各种域名服务器，这些都超出了系统的控制。

网页检索速度不仅由一个人的自己互联网连接速度有关，同时也受到了要抓取的网站的速度。

特别是如果一个是从多个服务器抓取的网站，总爬行时间可以大大减少，如果许多下载是并行完成。

虽然有众多的网络爬虫应用程序，他们在核心内容上基本上是相同的。

以下是应用程序网络爬虫的工作过程：1）下载网页2）通过下载的页面解析和检索所有的联系3）对于每一个环节检索，重复这个过程。

网络爬虫可用于通过对完整的网站的局域网进行抓取。

可以指定一个启动程序爬虫跟随在HTML页中找到所有链接。

这通常导致更多的链接，这之后将再次跟随，等等。

一个网站可以被视为一个树状结构看，根本是启动程序，在这根的HTML页的所有链接是根子链接。

随后循环获得更多的链接。

一个网页服务器提供若干网址清单给爬虫。

网络爬虫开始通过解析一个指定的网页，标注该网页指向其他网站页面的超文本链接。

然后他们分析这些网页之间新的联系，等等循环。

网络爬虫软件不实际移动到各地不同的互联网上的电脑，而是像电脑病毒一样通过智能代理进行。

每个爬虫每次大概打开大约300个链接。

这是索引网页必须的足够快的速度。

一个爬虫互留在一个机器。

爬虫只是简单的将HTTP请求的文件发送到互联网的其他机器，就像一个网上浏览器的链接，当用户点击。

所有的爬虫事实上是自动化追寻链接的过程。

网页检索可视为一个队列处理的项目。

当检索器访问一个网页，它提取到其他网页的链接。

因此，爬虫置身于这些网址的一个队列的末尾，并继续爬行到下一个页面，然后它从队列前面删除。

A.资源约束爬行消耗资源：下载页面的带宽，支持私人数据结构存储的内存，来评价和选择网址的CPU，以及存储文本和链接以及其他持久性数据的磁盘存储。

B.机器人协议机器人文件给出排除一部分的网站被抓取的指令。

类似地，一个简单的文本文件可以提供有关的新鲜和出版对象的流行信息。

对信息允许抓取工具优化其收集的数据刷新策略以及更换对象的政策。

C.元搜索引擎一个元搜索引擎是一种没有它自己的网页数据库的搜索引擎。

它发出的搜索支持其他搜索引擎所有的数据库，从所有的搜索引擎查询并为用户提供的结果。

较少的元搜索可以让您深入到最大，最有用的搜索引擎数据库。

他们往往返回最小或免费的搜索引擎和其他免费目录并且通常是小和高度商业化的结果。

5.爬行技术A：主题爬行一个通用的网络爬虫根据一个URL的特点设置来收集网页。

凡为主题爬虫的设计有一个特定的主题的文件，从而减少了网络流量和下载量。

主题爬虫的目标是有选择地寻找相关的网页的主题进行预先定义的设置。

指定的主题不使用关键字，但使用示范文件。

不是所有的收集和索引访问的Web文件能够回答所有可能的特殊查询，有一个主题爬虫爬行分析其抓起边界，找到链接，很可能是最适合抓取相关，并避免不相关的区域的Web。

这导致在硬件和网络资源极大地节省，并有助于于保持在最新状态的数据。

主题爬虫有三个主要组成部分一个分类器，这能够判断相关网页，决定抓取链接的拓展，过滤器决定过滤器抓取的网页，以确定优先访问中心次序的措施，以及均受量词和过滤器动态重新配置的优先的控制的爬虫。

最关键的评价是衡量主题爬行收获的比例，这是在抓取过程中有多少比例相关网页被采用和不相干的网页是有效地过滤掉，这收获率最高，否则主题爬虫会花很多时间在消除不相关的网页，而且使用一个普通的爬虫可能会更好。

B:分布式检索检索网络是一个挑战，因为它的成长性和动态性。

随着网络规模越来越大，已经称为必须并行处理检索程序，以完成在合理的时间内下载网页。

一个单一的检索程序，即使在是用多线程在大型引擎需要获取大量数据的快速上也存在不足。

当一个爬虫通过一个单一的物理链接被所有被提取的数据所使用，通过分配多种抓取活动的进程可以帮助建立一个可扩展的易于配置的系统，它具有容错性的系统。

拆分负载降低硬件要求，并在同一时间增加整体下载速度和可靠性。

每个任务都是在一个完全分布式的方式，也就是说，没有中央协调器的存在。

6、挑战更多“有趣”对象的问题搜索引擎被认为是一个热门话题，因为它收集用户查询记录。

检索程序优先抓取网站根据一些重要的度量，例如相似性（对有引导的查询），返回链接数网页排名或者其他组合/变化最精Najork等。

表明，首先考虑广泛优先搜索收集高品质页面，并提出一种网页排名。

然而，目前，搜索策略是无法准确选择“最佳”路径，因为他们的认识仅仅是局部的。

由于在互联网上可得到的信息数量非常庞大目前不可能实现全面的索引。

因此，必须采用剪裁策略。

主题爬行和智能检索，是发现相关的特定主题或主题集网页技术。

结论在本文中，我们得出这样的结论实现完整的网络爬行覆盖是不可能实现，因为受限于整个万维网的巨大规模和资源的可用性。

通常是通过一种阈值的设置（网站访问人数，网站上树的水平，与主题等规定），以限制对选定的网站上进行抓取的过程。

此信息是在搜索引擎可用于存储/刷新最相关和最新更新的网页，从而提高检索的内容质量，同时减少陈旧的内容和缺页。

原文：Discussion on Web Crawlers of Search EngineAbstract-With the precipitous expansion of the Web,extracting knowledge from the Web is becoming gradually important and popular.This is due to the Web’s convenience and richness of information.To find Web pages, one typically uses search engines that are based on the Web crawling framework.This paper describes the basic task performed search engine.Overview of how the Web crawlers are related with search engine.Keywords Distributed Crawling, Focused Crawling,Web CrawlersⅠ.INTRODUCTIONWWW on the Web is a service that resides on computers that are connected to the Internet and allows end users to access data that is stored on the computers using standard interface software. The World Wide Web is the universe ofnetwork-accessible information,an embodiment of human knowledge.Search engine is a computer program that searches for particular keywords and returns a list of documents in which they were found,especially a commercial service that scans documents on the Internet. A search engine finds information for its database by accepting listings sent it by authors who want exposure,or by getting the information from their “Web crawlers,””spiders,” or “robots,”programs that roam the Internet storing links to and information about each page they visit.Web Crawler is a program, which fetches information from the World Wide Web in an automated manner.Web crawling is an important research issue. Crawlers are software components, which visit portions of Web trees, according to certain strategies,and collect retrieved objects in local repositories.The rest of the paper is organized as: in Section 2 we explain the background details of Web crawlers.In Section 3 we discuss on types of crawler, in Section 4 we will explain the working of Web crawler. In Section 5 we cover the two advanced techniques of Web crawlers. In the Section 6 we discuss the problem of selecting more interesting pages.Ⅱ.SURVEY OF WEB CRAWLERSWeb crawlers are almost as old as the Web itself.The first crawler,Matthew Gray’s Wanderer, was written in the spring of 1993,roughly coinciding with the first release Mosaic.Several papers about Web crawling were presented at the first two World Wide Web conference.However,at the time, the Web was three to four orders of magnitude smaller than it is today,so those systems did not address the scaling problems inherent in a crawl of today’s Web.Obviously, all of the popular search engines use crawlers that must scale up to substantial portions of the Web. However, due to the competitive nature of the search engine business, the designs of these crawlers have not been publicly described. There are two notable exceptions:the Goole crawler and the Internet Archive crawler.Unfortunately,the descriptions of these crawlers in the literature are too terse to enable reproducibility.The original Google crawler (developed at Stanford) consisted of five functional components running in different processes. A URL server process read URLs out of a file and forwarded them to multiple crawler processes.Each crawler process ran on a different machine,was single-threaded,and used asynchronous I/O to fetch data from up to 300 Web servers in parallel. The crawlers transmitted downloaded pages to a single Store Server process, which compressed the pages and stored them to disk.The page were then read back from disk by an indexer process, which extracted links from HTML pages and saved them to a different disk file. A URLs resolver process read the link file, relative the URLs contained there in, and saved the absolute URLs to the disk file that was read by the URL server. Typically,three to four crawler machines were used, so the entire system required between four and eight machines. Research on Web crawling continues at Stanford even after Google has beentransformed into a commercial effort.The Stanford Web Base project has implemented a high performance distributed crawler,capable of downloading 50 to 100 documents per second.Cho and others have also developed models of documents update frequencies to inform the download schedule of incremental crawlers.The Internet Archive also used multiple machines to crawl the Web.Each crawler process was assigned up to 64 sites to crawl, and no site was assigned to more than one crawler.Each single-threaded crawler process read a list of seed URLs for its assigned sited from disk int per-site queues,and then used asynchronous I/O to fetch pages from these queues in parallel. Once a page was downloaded, the crawler extracted the links contained in it.If a link referred to the site of the page it was contained in, it was added to the appropriate site queue;otherwise it was logged to disk .Periodically, a batch process merged these logged “cross-sit” URLs into thesite--specific seed sets, filtering out duplicates in the process.The Web Fountain crawler shares several of Mercator’s characteristics:it is distributed,continuous(the authors use the term”incremental”),polite, and configurable.Unfortunately,as of this writing,Web Fountain is in the early stages of its development, and data about its performance is not yet available.Ⅲ.BASIC TYPESS OF SEARCH ENGINEA.Crawler Based Search EnginesCrawler based search engines create their listings puter programs ‘spider’ build them not by human selection. They are not organized by subject categories; a computer algorithm ranks all pages. Such kinds of search engines are huge and often retrieve a lot of information -- for complex searches it allows to search within the results of a previous search and enables you to refine search results. These types of search engines contain full text of the Web pages they link to .So one cann find pages by matching words in the pages one wants;B. Human Powered DirectoriesThese are built by human selection i.e. They depend on humans to create listings. They are organized into subject categories and subjects do classification ofpages.Human powered directories never contain full text of the Web page they linkto .They are smaller than most search engines.C.Hybrid Search EngineA hybrid search engine differs from traditional text oriented search engine such as Google or a directory-based search engine such as Yahoo in which each program operates by comparing a set of meta data, the primary corpus being the meta data derived from a Web crawler or taxonomic analysis of all internet text,and a user search query.In contrast, hybrid search engine may use these two bodies of meta data in addition to one or more sets of meta data that can, for example, include situational meta data derived from the client’s network that would model the context awareness of the client.Ⅳ.WORKING OF A WEB CRAWLERWeb crawlers are an essential component to search engines;running a Web crawler is a challenging task.There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of Web servers and various name servers, which are all beyond the control of the system. Web crawling speed is governed not only by the speed of one’s own Internet connection ,but also by the speed of the sites that are to be crawled.Especially if one is a crawling site from multiple servers, the total crawling time can be significantly reduced,if many downloads are done in parallel.Despite the numerous applications for Web crawlers,at the core they are all fundamentally the same. Following is the process by which Web crawlers work:1.Download the Web page.2.Parse through the downloaded page and retrieve all the links.3.For each link retrieved,repeat the process.The Web crawler can be used for crawling through a whole site on theInter-/Intranet.You specify a start-URL and the Crawler follows all links found in that HTML page.This usually leads to more links,which will be followed again, and so on.A site can be seen as a tree-structure,the root is the start-URL;all links in thatroot-HTML-page are direct sons of the root. Subsequent links are then sons of the previous sons.A single URL Server serves lists of URLs to a number of crawlers.Web crawler starts by parsing a specified Web page,noting any hypertext links on that page that point to other Web pages.They then parse those pages for new links,and soon,recursively.Web Crawler software doesn’t actually move around to different computers on the Internet, as viruses or intelligent agents do. Each crawler keeps roughly 300 connections open at once.This is necessary to retrieve Web page at a fast enough pace. A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a Web browser does when the user clicks on links. All the crawler really does is to automate the process of following links.Web crawling can be regarded as processing items in a queue. When the crawler visits a Web page,it extracts links to other Web pages.So the crawler puts these URLs at the end of a queue, and continues crawling to a URL that it removes from the front of the queue.A.Resource ConstraintsCrawlers consume resources: network bandwidth to download pages,memory to maintain private data structures in support of their algorithms,CUP to evaluate and select URLs,and disk storage to store the text and links of fetched pages as well as other persistent data.B.Robot ProtocolThe robot.txt file gives directives for excluding a portion of a Web site to be crawled. Analogously,a simple text file can furnish information about the freshness and popularity fo published objects.This information permits a crawler to optimize its strategy for refreshing collected data as well as replacing object policy.C.Meta Search EngineA meta-search engine is the kind of search engine that does not have its own database of Web pages.It sends search terms to the databases maintained by other search engines and gives users the result that come from all the search engines queried.Fewer meta searchers allow you to delve into the largest, most useful search engine databases. They tend to return results from smaller add/or search engines and miscellaneous free directories, often small and highly commercial.Ⅴ.CRAWLING TECHNIQUESA.Focused CrawlingA general purpose Web crawler gathers as many pages as it can from a particular set of URL’s.Where as a focused crawler is designed to only gather documents on a specific topic,thus reducing the amount of network traffic and downloads.The goal of the focused crawler is to selectively seek out pages that are relevant to a predefined set of topics.The topics re specified not using keywords,but using exemplary documents.Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl,and avoids irrelevant regions of the Web.This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. The focused crawler has three main components;: a classifier which makes relevance judgments on pages,crawled to decide on link expansion,a distiller which determines a measure of centrality of crawled pages to determine visit priorities, and a crawler with dynamically reconfigurable priority controls which is governed by the classifier and distiller.The most crucial evaluation of focused crawling is to measure the harvest ratio, which is rate at which relevant pages are acquired and irrelevant pages are effectively filtered off from the crawl.This harvest ratio must be high ,otherwise the focused crawler would spend a lot of time merely eliminating irrelevant pages, and it may be better to use an ordinary crawler instead.B.Distributed CrawlingIndexing the Web is a challenge due to its growing and dynamic nature.As the size of the Web sis growing it has become imperative to parallelize the crawling process in order to finish downloading the pages in a reasonable amount of time.A singlecrawling process even if multithreading is used will be insufficient for large - scale engines that need to fetch large amounts of data rapidly.When a single centralized crawler is used all the fetched data passes through a single physical link.Distributing the crawling activity via multiple processes can help build a scalable, easily configurable system,which is fault tolerant system.Splitting the load decreases hardware requirements and at the same time increases the overall download speed and reliability. Each task is performed in a fully distributed fashion,that is ,no central coordinator exits.Ⅵ.PROBLEM OF SELECTING MORE “INTERESTING”A search engine is aware of hot topics because it collects user queries.The crawling process prioritizes URLs according to an importance metric such as similarity(to a driving query),back-link count,Page Rank or their combinations/variations.Recently Najork et al. Showed that breadth-first search collects high-quality pages first and suggested a variant of Page Rank.However,at the moment,search strategies are unable to exactly select the “best” paths because their knowledge is only partial.Due to the enormous amount of information available on the Internet a total-crawling is at the moment impossible,thus,prune strategies must be applied.Focused crawling and intelligent crawling,are techniques for discovering Web pages relevant to a specific topic or set of topics.CONCLUSIONIn this paper we conclude that complete web crawling coverage cannot be achieved, due to the vast size of the whole WWW and to resource ually a kind of threshold is set up(number of visited URLs, level in the website tree,compliance with a topic,etc.)to limit the crawling process over a selected website.This information is available in search engines to store/refresh most relevant and updated web pages,thus improving quality of retrieved contents while reducing stale content and missing pages.。