探索搜索引擎爬虫毕业论文外文翻译(可编辑)

合集下载

搜索引擎网络爬虫设计与实现毕业设计

---------------------------------------------------------------最新资料推荐------------------------------------------------------ 搜索引擎网络爬虫设计与实现毕业设计- 网络中的资源非常丰富，但是如何有效的搜索信息却是一件困难的事情。

建立搜索引擎就是解决这个问题的最好方法。

本文首先详细介绍了基于英特网的搜索引擎的系统结构，然后具体阐述了如何设计并实现搜索引擎的搜索器网络爬虫。

多线程网络爬虫程序是从指定的 Web 页面中按照宽度优先算法进行解析、搜索，并把搜索到的每条 URL 进行抓取、保存并且以 URL 为新的入口在互联网上进行不断的爬行的自动执行后台程序。

网络爬虫主要应用 socket 套接字技术、正则表达式、 HTTP 协议、windows 网络编程技术等相关技术，以 C++语言作为实现语言，并在VC6.0 下调试通过。

在网络爬虫的设计与实现的章节中除了详细的阐述技术核心外还结合了多线程网络爬虫的实现代码来说明，易于理解。

本网络爬虫是一个能够在后台运行的以配置文件来作为初始URL，以宽度优先算法向下爬行，保存目标 URL 的网络程序，能够执行普通用户网络搜索任务。

搜索引擎；网络爬虫； URL 搜索器；多线程 - Design and Realization of Search Engine Network Spider Abstract The resource of network is very rich, but how to search the1 / 2effective information is a difficult task. The establishment of a search engine is the best way to solve this problem. This paper first introduces the internet-based search engine structure, and then illustrates how to implement search engine ----network spiders. The multi-thread network spider procedure is from the Web page which assigns according to the width priority algorithm connection for analysis and search, and each URL is snatched and preserved, and make the result URL as the new source entrance unceasing crawling on internet to carry out the backgoud automatically. My paper of network spider mainly applies to the socket technology, the regular expression, the HTTP agreement, the windows network programming technology and other correlation technique, and taking C++ language as implemented language, and passes under VC6.0 debugging. In the chapter of the spider design and implementation, besides a detailed exposition of the core technology in conjunction with the multi-threaded network spider to illustrate the realization of the code, it is easy to understand. This network spide...。

搜索引擎设计与实现外文翻译文献

搜索引擎设计与实现外文翻译文献(文档含英文原文和中文翻译)原文：The Hadoop Distributed File System: Architecture and DesignIntroductionThe Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed onlow-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project.Assumptions and GoalsHardware FailureHardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact t hat there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.Streaming Data AccessApplications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.Large Data SetsApplications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.Simple Coherency ModelHDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. AMap/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.“Moving Computation is Cheaper than Moving Data”A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.Portability Across Heterogeneous Hardware and Software PlatformsHDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.NameNode and DataNodesHDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFSmetadata. The system is designed in such a way that user data never flows through the NameNode.The File System NamespaceHDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.Data ReplicationHDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.Replica Placement: The First Baby StepsThe placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in thesame rack is greater than network bandwidth between machines in different racks.The NameNode determines the rack id each DataNode belongs to via the process outlined in Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.The current, default replica placement policy described here is a work in progress. Replica SelectionTo minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.SafemodeOn startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. ABlockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.The Persistence of File System MetadataThe HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of RAM is plenty to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the NameNode starts up. Work is in progress to support periodic checkpointing in the near future.The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and creates subdirectories appropriately. It is not optimal to create alllocal files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: this is the Blockreport.The Communication ProtocolsAll HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.RobustnessThe primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.Data Disk Failure, Heartbeats and Re-ReplicationEach DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.Cluster RebalancingThe HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.Data IntegrityIt is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.Metadata Disk FailureThe FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.SnapshotsSnapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to apreviously known good point in time. HDFS does not currently support snapshots but will in a future release.Data OrganizationData BlocksHDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semanticson files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode.StagingA client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.The above approach has been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes to files. If a client writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement has been relaxed to achieve higher performance of data uploads.Replication PipeliningWhen a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.AccessibilityHDFS can be accessed from applications in many different ways. Natively, HDFS provides a Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol.FS ShellHDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called FS shell that lets a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample action/command pairs:FS shell is targeted for applications that need a scripting language to interact with the stored data.DFSAdminThe DFSAdmin command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs:Browser InterfaceA typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.Space ReclamationFile Deletes and UndeletesWhen a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash. A file remains in/trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she cannavigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.Decrease Replication FactorWhen the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.译文：Hadoop分布式文件系统:架构和设计一、引言Hadoop分布式文件系统(HDFS)被设计成适合运行在通用硬件(commodity hardware)上的分布式文件系统。

搜索引擎英语作文

搜索引擎英语作文一篇关于搜索引擎的英语文章，掀开了现代的篇章。

下面是给大家整理的搜索引擎英语作文，供大家参阅!搜索引擎英语作文篇1从最直接的网民体验;;搜索速度来说,Google总是比中文搜索引擎百度慢半拍.即使在Google宣布进入中国之后,其主页无法打开的现象仍然时有发生,这是令许多曾是Google忠实用户的网民倒戈百度的主要原因.在mp3搜索、贴吧等个性化的服务方面,百度对中国用户的需求了解的要比Google深入得多.From the experience of netizen-search speed,Google is a little slower than Baidu.Even when Google claims to come to China,some home page can't be open still happen.This makes some fans of Google begin to use Baidu.In MP3 searching,Placard...individual service,Baidu meets national customer's requirement a lot than Google.从搜索技术角度来看,Google一直以过硬的搜索技术为荣,在英文搜索方面确实如此.但是中文作为世界上最复杂的语言之一,在搜索技术方面与英文存在着很大的不同,而Google目前的搜索表现尚不尽如人意.From the search technic,Google is proud of its advantagein searching technic.It's the truth in English searching,but in Chineses- the complex language in the world,there's much difference.At present it can't be satisfied.搜索引擎英语作文篇2Google is an American public corporation,earning revenue from advertising related to its Internet search,e-mail,online mapping,office productivity,social networking,and video sharing services as well as selling advertising-free versions of the same technologies.The Google headquarters,the Googleplex,is located in Mountain View,California.As of December 31,2008,the company has 20,222 full-time employees.Google was co-founded by Larry Page and Sergey Brin while they were students at Stanford University and the company was first incorporated as a privately held company on September 4,1998.The initial public offering took place on August 19,2004,raising US$1.67 billion,making it worth US$23 billion.Google has continued its growth through a series of new product developments,acquisitions,and partnerships.Environmentalism,philanthropy and positive employee relations have been important tenets during the growth of Google,the latter resulting in being identified multiple times as Fortune Magazine's #1 Best Place toWork.The unofficial company slogan is "Don't be evil",although criticism of Google includes concerns regarding the privacy of personal information,copyright,censorship and discontinuation of services.According to Millward Brown,it is the most powerful brand in the world.搜索引擎英语作文篇3如果你想搜索信息在互联网上，你可以使用搜索引擎。

网络爬虫毕业论文

网络爬虫毕业论文网络爬虫：数据挖掘的利器随着互联网的迅猛发展，我们进入了一个信息爆炸的时代。

海量的数据涌入我们的生活，如何从这些数据中获取有用的信息成为了一个重要的问题。

在这个背景下，网络爬虫应运而生，成为了数据挖掘的利器。

一、网络爬虫的定义和原理网络爬虫，顾名思义，就是像蜘蛛一样在网络上爬行，自动地从网页中提取信息。

它的工作原理可以简单地概括为以下几个步骤：首先，爬虫会从一个起始网页开始，通过解析网页中的链接找到其他网页；然后，它会递归地访问这些链接，进一步抓取网页；最后，爬虫会将抓取到的网页进行处理，提取出所需的信息。

二、网络爬虫的应用领域网络爬虫在各个领域都有广泛的应用。

在搜索引擎领域，爬虫是搜索引擎的核心组成部分，它通过抓取网页并建立索引，为用户提供准确、全面的搜索结果。

在电子商务领域，爬虫可以用来抓取商品信息，帮助企业了解市场动态和竞争对手的情况。

在金融领域，爬虫可以用来抓取股票、基金等金融数据，为投资者提供决策依据。

此外，爬虫还可以应用于舆情监测、航空订票、房产信息等领域。

三、网络爬虫的技术挑战尽管网络爬虫在各个领域都有广泛的应用，但是它也面临着一些技术挑战。

首先，网络爬虫需要解决网页的反爬虫机制，如验证码、IP封锁等，以确保能够正常抓取数据。

其次，网络爬虫还需要处理大规模数据的存储和处理问题，以确保抓取的数据能够高效地被利用。

此外，网络爬虫还需要解决网页结构的变化和网页内容的多样性等问题，以确保能够准确地提取所需信息。

四、网络爬虫的伦理问题随着网络爬虫的应用越来越广泛，一些伦理问题也逐渐浮现出来。

首先，网络爬虫可能会侵犯个人隐私，特别是在抓取个人信息时需要注意保护用户的隐私权。

其次，网络爬虫可能会对网站的正常运行造成影响，如过于频繁地访问网站可能会导致网站崩溃。

因此，在使用网络爬虫时，需要遵守相关的法律法规和伦理规范，确保合法、合理地使用爬虫工具。

五、网络爬虫的未来发展随着人工智能和大数据技术的不断发展，网络爬虫在未来还将有更广阔的应用前景。

双语平行语料爬虫器设计与开发-毕业论文

---文档均为word文档，下载后可直接编辑使用亦可打印---摘要双语平行语料对于许多任务来说是一种十分重要的资源，尤其是在自然语言处理的领域，有着丰富的使用。

而在自然语言处理中，机器翻译更是十分依赖平行语料的一种。

机器翻译的质量直接依赖于平行语料的数量以及质量。

因此，获取大量的双语平行语料是一个亟待解决的问题。

本文主要提出了一整套完整的双语语料获取技术。

该技术意在从纷杂的互联网网站中找到多语言的网站，然后获取网站的页面，提取平行语料。

现有的技术很难大规模，全面得(地)发现网络上存在的所有多语言网站并爬取平行语料。

如果随机寻找多语言的网站，会浪费大量的时间，可能会浪费很多时间在检测普通的网页上。

基于此，有必要针对上述技术问题，提供一种发现全网多语言网站并获得平行语料的方法，通过开源的数据集，利用语言标签的方法，得到候选的多语言网站的URL，然后采用开源的工具快速获取平行语料。

使用该方法虽然可以快速得到平行语料，但是语料的质量参差不齐，所以还需要建立一种评判所得到的双语平行语料并筛选其中质量较低的双语平行语料的机制。

关键词：双语平行语料；多语言网站；筛选AbstractBilingual parallel corpus is a very important resource for many tasks, especially in the field of Natural Language Processing(NLP). In Natural Language Processing, machine translation is very dependent on parallel corpus. The quality of machine translation directly depends on the quantity and quality of parallel corpus. Therefore, obtaining a large number of bilingual parallel corpus is an urgent problem to be solved.This article mainly proposes a complete set of bilingual corpus acquisition technology. The technology is intended to find multi-language websites from confusing internet websites, and then obtain website pages and extract parallel corpora. The existing technology is difficult to get and crawl all the multi-language websites in large-scale existing on the Internet. If you are looking for a multilingual site at random, you will waste a lot of time detecting common web pages. Based on this, it is necessary to provide a method for discovering all-language multilingual websites and obtaining parallel corpus according to the above technical problems. By using open source datasets and language tagging methods, URLs of candidate multilingual websites are obtained, and then with open sourced tool we quickly acquires parallel corpus. Although this method can quickly obtain parallel corpus, the quality of the corpus varies. Therefore, it is necessary to establish a bilingual parallel corpus obtained by judging and to screen the mechanism of bilingual parallel corpus with lower quality.Keywords: Bilingual parallel corpus; Multilingual websites; filter前言双语平行语料对于许多任务来说是一种十分重要的资源，尤其是在自然语言处理的领域，有着丰富的使用。

搜索引擎分析_Google(中英对照)

The Anatomy of a Large-ScaleHypertextual Web Search EngineSergey Brin and Lawrence Page{sergey, page}@Computer Science Department, Stanford University, Stanford, CA 94305AbstractIn this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at /To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date.Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.Keywords: World Wide Web, Search Engines, Information Retrieval,PageRank, Google1. Introduction(Note: There are two versions of this paper -- a longer full version and a shorter printed version. The full version is available on the web and the conference CD-ROM.)The web creates new challenges for information retrieval. The amount ofinformation on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. People are likely to surf the web using its link graph, often starting with high quality human maintained indices such as Yahoo!or with search engines. Human maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics. Automated search engines that rely on keyword matching usually return too many low quality matches. To make matters worse, some advertisers attempt to gain people's attention by taking measures meant to mislead automated search engines. We have built a large-scale search engine which addresses many of the problems of existing systems. It makes especially heavy use of the additional structure present in hypertext to provide much higher quality search results. We chose our system name, Google, because it is a common spelling of googol, or 10100and fits well with our goal of building very large-scale search engines.1.1 Web Search Engines -- Scaling Up: 1994 - 2000Search engine technology has had to scale dramatically to keep up with the growth of the web. In 1994, one of the first web search engines, the World Wide Web Worm (WWWW) [McBryan 94] had an index of 110,000 web pages and web accessible documents. As of November, 1997, the top search engines claim to index from 2 million (WebCrawler) to 100 million web documents (from Search Engine Watch). It is foreseeable that by the year 2000, a comprehensive index of the Web will contain over a billion documents. At the same time, the number of queries search engines handle has grown incredibly too. In March and April 1994, the World Wide Web Worm received an average of about 1500 queries per day. In November 1997, Altavista claimed it handled roughly 20 million queries per day. With the increasing number of users on the web, and automated systems which query search engines, it is likely that top search engines will handle hundreds of millions of queries per day by the year 2000. The goal of our system is to address many of the problems, both in quality and scalability, introduced by scaling search engine technology to such extraordinary numbers.1.2. Google: Scaling with the WebCreating a search engine which scales even to today's web presents many challenges. Fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing systemmust process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thousands per second.These tasks are becoming increasingly difficult as the Web grows. However, hardware performance and cost have improved dramatically to partially offset the difficulty. There are, however, several notable exceptions to this progress such as disk seek time and operating system robustness. In designing Google, we have considered both the rate of growth of the Web and technological changes. Google is designed to scale well to extremely large data sets. It makes efficient use of storage space to store the index. Its data structures are optimized for fast and efficient access (see section 4.2). Further, we expect that the cost to index and store text or HTML will eventually decline relative to the amount that will be available (see Appendix B). This will result in favorable scaling properties for centralized systems like Google.1.3 Design Goals1.3.1 Improved Search QualityOur main goal is to improve the quality of web search engines. In 1994, some people believed that a complete search index would make it possible to find anything easily. According to Best of the Web 1994 -- Navigators, "The best navigation service should make it easy to find almost anything on the Web (once all the data is entered)." However, the Web of 1997 is quite different. Anyone who has used a search engine recently, can readily testify that the completeness of the index is not the only factor in the quality of search results. "Junk results" often wash out any results that a user is interested in. In fact, as of November 1997, only one of the top four commercial search engines finds itself (returns its own search page in response to its name in the top ten results). One of the main causes of this problem is that the number of documents in the indices has been increasing by many orders of magnitude, but the user's ability to look at documents has not. People are still only willing to look at the first few tens of results. Because of this, as the collection size grows, we need tools that have very high precision (number of relevant documents returned, say in the top tens of results). Indeed, we want our notion of "relevant" to only include the very best documents since there may be tens of thousands of slightly relevant documents. This very high precision is important even at the expense of recall (the total number of relevant documents the system is able to return). There is quite a bit of recent optimism that the use of more hypertextual information can help improve search and other applications [Marchiori 97] [Spertus 97] [Weiss 96] [Kleinberg 98]. In particular, link structure [Page 98] and link textprovide a lot of information for making relevance judgments and quality filtering. Google makes use of both link structure and anchor text (see Sections 2.1 and 2.2).1.3.2 Academic Search Engine ResearchAside from tremendous growth, the Web has also become increasingly commercial over time. In 1993, 1.5% of web servers were on .com domains. This number grew to over 60% in 1997. At the same time, search engines have migrated from the academic domain to the commercial. Up until now most search engine development has gone on at companies with little publication of technical details. This causes search engine technology to remain largely a black art and to be advertising oriented (see Appendix A). With Google, we have a strong goal to push more development and understanding into the academic realm.Another important design goal was to build systems that reasonable numbers of people can actually use. Usage was important to us because we think some of the most interesting research will involve leveraging the vast amount of usage data that is available from modern web systems. For example, there are many tens of millions of searches performed every day. However, it is very difficult to get this data, mainly because it is considered commercially valuable.Our final design goal was to build an architecture that can support novel research activities on large-scale web data. To support novel research uses, Google stores all of the actual documents it crawls in compressed form. One of our main goals in designing Google was to set up an environment where other researchers can come in quickly, process large chunks of the web, and produce interesting results that would have been very difficult to produce otherwise. In the short time the system has been up, there have already been several papers using databases generated by Google, and many others are underway. Another goal we have is to set up a Spacelab-like environment where researchers or even students can propose and do interesting experiments on our large-scale web data.2. System FeaturesThe Google search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank and is described in detail in [Page 98]. Second, Google utilizes link to improve search results.2.1 PageRank: Bringing Order to the WebThe citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. We have created maps containing as many as 518 million of these hyperlinks, a significant sample of the total. These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at). For the type of full text searches in the main Google system, PageRank also helps a great deal.2.1.1 Description of PageRank CalculationAcademic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. There are many other details which are beyond the scope of this paper.2.1.2 Intuitive JustificationPageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts onanother random page. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page. One important variation is to only add the damping factor d to a single page, or a group of pages. This allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking. We have several other extensions to PageRank, again see [Page 98].Another intuitive justification is that a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo's homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web.2.2 Anchor TextThe text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens.This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm [McBryan 94] especially because it helps search non-text information, and expands the search coverage with fewer downloaded documents. We use anchor propagation mostly because anchor text can help provide better quality results. Using anchor text efficiently is technically difficult because of the large amounts of data which must be processed. In our current crawl of 24 million pages, we had over 259 million anchors which we indexed.2.3 Other FeaturesAside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font are weighted higher than other words. Third, full raw HTML of pages is available in a repository.3 Related WorkSearch research on the web has a short and concise history. The World Wide Web Worm (WWWW) [McBryan 94] was one of the first web search engines. It was subsequently followed by several other academic search engines, many of which are now public companies. Compared to the growth of the Web and the importance of search engines there are precious few documents about recent search engines [Pinkerton 94]. According to Michael Mauldin (chief scientist, Lycos Inc) [Mauldin], "the various services (including Lycos) closely guard the details of these databases". However, there has been a fair amount of work on specific features of search engines. Especially well represented is work which can get results by post-processing the results of existing commercial search engines, or produce small scale "individualized" search engines. Finally, there has been a lot of research on information retrieval systems, especially on well controlled collections. In the next two sections, we discuss some areas where this research needs to be extended to work better on the web.3.1 Information RetrievalWork in information retrieval systems goes back many years and is well developed [Witten 94]. However, most of the research on information retrieval systems is on small well controlled homogeneous collections such as collections of scientific papers or news stories on a related topic. Indeed, the primary benchmark for information retrieval, the Text Retrieval Conference [TREC 96], uses a fairly small, well controlled collection for their benchmarks. The "Very Large Corpus" benchmark is only 20GB compared to the 147GB from our crawl of 24 million web pages. Things that work well on TREC often do not produce good results on the web. For example, the standard vector space model tries to return the document that most closely approximates the query, given that both query and document are vectors defined by their word occurrence. On the web, this strategy often returns very short documents that are the query plus a few words.For example, we have seen a major search engine return a page containing only "Bill Clinton Sucks" and picture from a "Bill Clinton" query. Some argue that on the web, users should specify more accurately what they want and add more words to their query. We disagree vehemently with this position. If a user issues a query like "Bill Clinton" they should get reasonable results since there is a enormous amount of high quality information available on this topic. Given examples like these, we believe that the standard information retrieval work needs to be extended to deal effectively with the web.3.2 Differences Between the Web and Well Controlled CollectionsThe web is a vast collection of completely uncontrolled heterogeneous documents. Documents on the web have extreme variation internal to the documents, and also in the external meta information that might be available. For example, documents differ internally in their language (both human and programming), vocabulary (email addresses, links, zip codes, phone numbers, product numbers), type or format (text, HTML, PDF, images, sounds), and may even be machine generated (log files or output from a database). On the other hand, we define external meta information as information that can be inferred about a document, but is not contained within it. Examples of external meta information include things like reputation of the source, update frequency, quality, popularity or usage, and citations. Not only are the possible sources of external meta information varied, but the things that are being measured vary many orders of magnitude as well. For example, compare the usage information from a major homepage, like Yahoo's which currently receives millions of page views every day with an obscure historical article which might receive one view every ten years. Clearly, these two items must be treated very differently by a search engine.Another big difference between the web and traditional well controlled collections is that there is virtually no control over what people can put on the web. Couple this flexibility to publish anything with the enormous influence of search engines to route traffic and companies which deliberately manipulating search engines for profit become a serious problem. This problem that has not been addressed in traditional closed information retrieval systems. Also, it is interesting to note that metadata efforts have largely failed with web search engines, because any text on the page which is not directly represented to the user is abused to manipulate search engines. There are even numerous companies which specialize in manipulating search engines for profit.4 System AnatomyFirst, we will provide a high level discussion of the architecture. Then, there is some in-depth descriptions of important data structures. Finally, the major applications: crawling, indexing, and searching will be examined in depth.4.1 Google ArchitectureOverviewIn this section, we will give ahigh level overview of how thewhole system works as pictured inFigure 1. Further sections willdiscuss the applications anddata structures not mentioned inthis section. Most of Google isimplemented in C or C++ forefficiency and can run in eitherSolaris or Linux. In Google, the web crawling (downloading of web pages) isdone by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, andcapitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.Figure 1. High Level Google ArchitectureThe URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.4.2 Major Data StructuresGoogle's data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.4.2.1 BigFilesBigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers. The allocation among multiple file systems is handled automatically. The BigFiles package also handles allocation and deallocation of file descriptors, since the operating systems do not provide enough for our needs. BigFiles also support rudimentary compression options.4.2.2 RepositoryThe repository contains the fullHTML of every web page. Each pageis compressed using zlib (seeRFC1950). The choice ofcompression technique is atradeoff between speed andFigure 2. Repository Data Structure compression ratio. We chose zlib'sspeed over a significant improvement in compression offered by bzip. The compression rate of bzip was approximately 4 to 1 on the repository as compared to zlib's 3 to 1 compression. In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL as can be seen in Figure 2. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; we can rebuild all the other data structures from only the repository and a file which lists crawler errors.4.2.3 Document IndexThe document index keeps information about each document. It is a fixed width ISAM (Index sequential access mode) index, ordered by docID. The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. If the document has been crawled, it also contains a pointer into a variable width file called docinfo which contains its URL and title. Otherwise the pointer points into the URLlist which contains just the URL. This design decision was driven by the desire to have a reasonably compact data structure, and the ability to fetch a record in one disk seek during a searchAdditionally, there is a file which is used to convert URLs into docIDs. It is a list of URL checksums with their corresponding docIDs and is sorted by checksum. In order to find the docID of a particular URL, the URL's checksum is computed and a binary search is performed on the checksums file to find its docID. URLs may be converted into docIDs in batch by doing a merge with this file. This is the technique the URLresolver uses to turn URLs into docIDs. This batch mode of update is crucial because otherwise we must perform one seek for every link which assuming one disk would take more than a month for our 322 million link dataset.4.2.4 LexiconThe lexicon has several different forms. One important change from earlier systems is that the lexicon can fit in memory for a reasonable price. In the current implementation we can keep the lexicon in memory on a machine with 256 MB of main memory. The current lexicon contains 14 million words (though some rare words were not added to the lexicon). It is implemented in two parts -- a list of the words (concatenated together but separated by nulls) and a hash table of pointers. For various functions, the list of words has some auxiliary information which is beyond the scope of this paper to explain fully.4.2.5 Hit ListsA hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Because of this, it is important to represent them as efficiently as possible. We considered several alternatives for encoding position, font, and capitalization -- simple encoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffman coding. In the end we chose a hand optimized compact encoding since it required far less space than the simple encoding and far less bit manipulation than Huffman coding. The details of the hits are shown in Figure 3.Our compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits. Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everything else.A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document (all positions higher than 4095 are labeled 4096). Font size is represented relative to the rest of the document using three bits (only 7 values are actually used because 111 is the flag that signals a fancy hit). A fancy hit consists of a capitalization bit, the font size set to 7 to indicate it is a fancy hit, 4 bits to encode the type of fancy hit, and 8 bits of position. For anchor hits, the 8 bits of position are split into 4 bits for position in anchor and 4 bits for a hash of the docID the anchor occurs in. This gives us some limited phrase searching as long as there are not that many anchors for a particular word. We expect to update the way that anchor hits are stored to allow for greater resolution in the position and docIDhash fields. We use font size relative to the rest of the document because when searching, you do not want to rank otherwise identical documents differently just because one of the documents is in a larger font.。

外文翻译--网络服务的爬虫引擎

外文资料WSCE: A Crawler Engine for Large-Scale Discovery of Web ServicesEyhab Al-Masri and Qusay H. MahmoudAbstractThis paper addresses issues relating to the efficient access and discovery of Web services across multiple UDDI Business Registries (UBRs). The ability to explore Web services across multiple UBRs is becoming a challenge particularly as size and magnitude of these registries increase. As Web services proliferate, finding an appropriate Web service across one or more service registries using existing registry APIs (i.e. UDDI APIs) raises a number of concerns such as performance, efficiency, end-to-end reliability, and most importantly quality of returned results. Clients do not have to endlessly search accessible UBRs for finding appropriate Web services particularly when operating via mobile devices. Finding relevant Webservices should be time effective and highly productive. In an attempt to enhance the efficiency of searching for businesses and Web services across multiple UBRs, we propose a novel exploration engine, the Web Service Crawler Engine (WSCE). WSCE is capable of crawling multiple UBRs, and enables for the establishment of a centralized Web services’repository which can be used for large-scale discovery of Web services. The paper presents experimental validation, results, and analysis of the presented ideas.1. IntroductionThe continuous growth and propagation of the internet have been some of the main factors for information overload which at many instances act as deterrents for quick and easy discovery of information. Web services are internet-based, modular applications, and the automatic discovery and composition of Web services are an emerging technology of choice for building understandable applications used for business-to-business integration and are of an immense interest to governments, businesses, as well as individuals. As Web services proliferate, the same dilemma perceived in the discovery of Web pages will become tangible and the searching for specific business applications or Web services becomes challenging and time consuming particularly as the number of UDDI Business Registries (UBRs) begins to multiply.In addition, decentralizing UBRs adds another level of complexity on how to effectively find Web services within these distributed registries. Decentralization of UBRs is becoming tangible as new operating systems, applications, and APIs are already equipped with built-in functionalities and tools that enable organizations orbusinesses to publish their own internal UBRs for intranet and extranet use such as the Enterprise UDDI Services in Windows Server 2003, WebShpere Application Server, Systinet Business Registry, jUDDI, to name a few. Enabling businesses or organizations to self-operate and mange their own UBRs will maximize the likelihood of having a significant increase in the number of business registries and therefore, clients will soon face the challenge of finding Web services across hundreds, if not thousands of UBRs.At the heart of the Service Oriented Architecture (SOA) is a service registry which connects and mediates service providers with clients as shown in Figure 1. Service registries extend the concept of an application-centric Web by allowing clients (or conceivably applications) to access a wide range of Web services that match specific search criteria in an autonomous manner.Without publishing Web services through registries, clients will not be able to locate services in an efficient manner, and service providers will have to devote extra efforts in advertising their services through other channels. There are several companies that offer Web-based Web service directories such as WebServiceList [1], RemoteMethods [2], WSIndex [3], and [4]. However, due to the fact that these Web-based service directories fail to adhere to Web services’ standards such as UDDI, it is likely that they become vulnerable to being unreliable sources forfinding relevant Web services, and may become disconnected from the Web services environment as in the cases of BindingPoint and SalCentral which closed their Web-based Web service directories after many years of exposure.Apart from having Web-based service directories, there have been numerous efforts that attempted to improve the discovery of Web services [5,6,9,21], however, many of them have failed to address the issue of handling discovery operations across multiple UBRs. Due to the fact that UBRs are hosted on Web servers, they aredependent on network traffic and performance, and therefore, clients that are looking for appropriate Web services are susceptible to performance issueswhen carrying out multiple UBR search requests. To address the above-mentioned issues, this work introduces a framework that serves as the heart of our Web Services Repository Builder (WSRB) architecture [7] by enhancing the discovery of Web services without having any modifications to exiting standards. In this paper, we propose the Web Service Crawler Engine (WSCE) which actively crawls accessible UBRs and collects business and Web service information. Our architecture enables businesses and organizations to maintain autonomous control over their UBRs while allowing clients to perform search queries adapted to large-scale discovery of Web services. Our solution has been tested and results present high performance rates when compared with other existing models.The remainder of this paper is organized as follows. Section two discusses related work. Section three discusses some of the limitations with existing UBRs. Section four discusses the motivations for WSCE. Section five presents our Web service crawler engine’s architecture. Experiments and results are discussed in Section six, and finally conclusion and future work are discussed in Section seven.2. Related WorkDiscovery of Web services is a fundamental area of research in ubiquitous computing. Many researchers have focused on discovering Web services through a centralized UDDI registry [8,9,10]. Although centralized registries can provide effective methods for the discovery of Web services, they suffer from problems associated with having centralized systems such as single point of failure, and bottlenecks. In addition, other issues relating to the scalability of data replication, providing notifications to all subscribers when performing any system upgrades, and handling versioning of services from the same provider have driven researchers to find other alternatives. Other approaches focused on having multiple public/private registries grouped into registry federations [6,12] such as METEOR-S for enhancing the discovery process. METEOR-S provides a discovery mechanism for publishing Web services over federated registries but this solution does not provide the means for articulating advanced search techniques which are essential for locating appropriate business applications. In addition, having federated registry environments can potentially provide inconsistent policies to be employed which will have a significant impact on the practicability of conducting inquiries across them. Furthermore, federated registry environments will have increased configuration overhead, additional processing time, and poor performance in terms of execution time when performing service discovery operations. A desirabl e solution would be a Web services’ crawler engine such as WSCE that can facilitate the aggregation of Web service references, resources, and description documents, and can provide clients with a standard,universal access point for discovering Web services distributed across multiple registries.Several approaches focused on applying traditional Information Retrieval (IR) techniques or using keyword-based matching [13,14] which primarily depend on analyzing the frequency of terms. Other attempts focused on schema matching [15,16] which try to understand the meanings of the schemas and suggest any trends or patterns. Other approaches studied the use of supervised classification and unsupervised clustering of Web services [17], artificial neural networks [18], or using unsupervised matching at the operation level [19].Other approaches focused on the peer-to-peer framework architecture for service discovery and ranking [20], providing a conceptual model based on Web service reputation [21], and providing keyword-based search engine for querying Web services [22]. However, many of these approaches provide a very limited set of search methods (i.e. search by business name, business location, etc.) and attempt to apply traditional IR techniques that may not b e suitable for services’ discovery since Web services often contain or provide very brief textual description of what they offer. In addition, the Web services’ structure is complex and only a small portion of text is often provided.WSCE enhances the process of discovering Web services by providing advanced search capabilities for locating proper business applications across one or more UDDI registries and any other searchable repositories. In addition, WSCE allows for high performance and reliable discovery mechanism while current approaches are mainly dependent on external resources which in turn can significantly impact the ability to provide accurate and meaningful results. Furthermore, current techniques do not take into consideration the ability to predict, detect, recover from failures at the Web service host, or keep track of any dynamic updates or service changes.3. UDDI Business Registries (UBRs)Business registries provide the foundation for the cataloging and classification of Web services and other additional components. A UDDI Business Registry (UBR) serves as a service directory for the publishing of technical information about Web services [23]. The UDDI is an initiative originally backed up by several technology companies including Microsoft, IBM, and Ariba [24] and aims at providing a focal point where all businesses, including their Web services meet together in an open and platform-independent framework. Hundreds of other companies have endorsed the UDDI initiative including HP, Intel, Fujitsu, BEA, Oracle, SAP, Nortel Networks, WebMethods, Andersen Consulting, Sun Microsystems, to name a few. E-Business XML (ebXML) is another service registry standard that focuses more on the collaboration between businesses [27]. Although commonalities between UDDI and ebXML registries present opportunities for interoperability between them [26], theUDDI remains the de facto industry standard for Web service discovery [21]. Although the UDDI provides ways for locating businesses and how to interface with them electronically, it is limited to a single search criterion. Keyword-based search techniques offered by UDDI will make it impractical to assume that it can be very useful for Web services’ discovery or composition. In addition, a client does not have to endlessly search UBRs for finding an appropriate Web service. As Web services proliferate and the number of UBRs increases, limited search capabilities are likely to yield less meaningful search results which makes the task of performing search queries across one or multiple UBRs very time consuming, and less productive.3.1. Limitations with Current UDDIApart from the problems regarding limited search capabilities offered by UDDI, there are other major limitations and shortcomings with the existing UDDI standard. Some of these limitations include: (1) UDDI was intended to be used only for Web services’ discovery; (2) UDDI registration is voluntary, and therefore, it risks becoming passive; (3) UDDI does not provide any guarantees to the validity and quality of information it contains; (4) the disconnection between UDDI and the current Web; (5) UDDI is incapable of providing Quality of Service (QoS) measurements for registered Web services, which can provide helpful information to clients when choosing appropriate Web services, (6) UDDI does not clearly define how service providers can advertise pricing models; and (7) UDDI does not maintain nor provide any Web service life-cycle information (i.e. Web services across stages). Other limitations with the current UDDI standard [23] are shown in Table 1.Although the UDDI has been the de facto industry standard for Web services’ discovery, the ability to find a scalable solution for handling significant amounts of data from multiple UBRs at a large-scale is becoming a critical issue. Furthermore, the search time when searching one or multiple UDDI registries (i.e. meta-discovery) raises several concerns in terms performance, efficiency, reliability and the quality of returned results.4. Motivations for WSCEWeb services are syntactically described using the Web Service Description Language (WSDL) which concentrates on describing Web services at the functional level. A more elaborate business-centric model for Web services is provided by the UDDI which allows businesses to create many-to-many partnership relationships and serves as a focal point where all businesses of all sizes can meet together in an open and a global framework. Although there have been numerous standards that support the description and discovery of Web services, combining these sources of information in a simple manner for clients to apprehend and use is not currently present. In order for clients to search or invoke services, first they have to manually perform search queries to an existing UBR based on a primitive keyword-based technique, loop through returned results, extract binding information (i.e. through bindingTemplates or via WSDL access points), and manually examine their technical details. In this case, clients have to manually collect Web service information from different types of resources which may not be a reliable approach for collecting information about Web services. What is therefore desirable is a Web services’ crawler engine such as WSCE that facilit ates the aggregation of Web service references, resources, and description documents and provides a well defined access pattern of usages on how to discover Web services. WSCE facilitates the establishment of a Web services’ search engine in which service providers will have enough visibility for their services, and at the same time clients will have the appropriate tools for performing advanced search queries. The crucial design of WSCE is motivated by several factors including: (1) the inability to periodically keep track of business and Web service life-cycle using existing UDDI design, which can provide extremely helpful information serving as the basis for documenting Web services across stages; (2) the inherent search criterion offered by UDDI inquiry API which would not be beneficial for finding services of interest; (3) the apparent disconnection between UBRs from the existing Web; and (4) performance issues with real-time search queries across multiple UBRs which will eventually become very time consuming as the number of UBRs increases while UDDI clients may not have the potential of searching all accessible UBRs. Other factors of motivation will become apparent as we introduce WSCE.5. Web Service Crawler Engine (WSCE)The Web Service Crawler Engine (WSCE) is part of the Web Services Repository Builder (WSRB) [7,11] in which it actively crawls accessible UBRs, and collects information into a centralized repository called Web Service Storage (WSS). The discovery of Web services in principle can be achieved through a number of approaches. Resources that can be used to collect Web service information may vary but all serve as an aggregate for Web service information. Prior to explaining the details of WSCE, it is important to discuss current Web service data resources that can be used for implementing WSCE.5.1. Web Service ResourcesFinding information about Web services is not strictly tied to UBRs. There are other standards that support the description, discovery of businesses, organizations, service providers, and their Web services which they make available, while interfaces that contain technical details are used for allowing the proper access to those services. For example, WSDL describes message operations, network protocols, and access points to addresses used by Web services; XML Schemas describe the grammatical XML structure sent and received by Web services; WS-Policy describes general features, requirements, and capabilities of Web services; UDDI business registries describe a more business-centric model for publishing Web services; WSDL-Semantics (WSDL-S) uses semantic annotations that define meaning of inputs, outputs, preconditions, and effects of operations described by an interface.WSCE uses UBRs and WSDL files as data sources for collecting information into the Web Service Storage (WSS) since they contain the necessary information for describing and discovering Web services. We investigated several methods for obtaining Web service information and findings are summarized below:• Web-based: Web-based crawling involved using an existing search engine API to discover WSDL files across the Web such as Google SOAP Search API. Unfortunately, a considerable amount of WSDL files crawled over the Web did not contain descriptions of what these Web services have to offer. In addition, a large amount of crawled Web services contain outdated, passive, or incomplete information. About 340 Web services were collected using this method and only 23% of the collected WSDL files contained an adequate level of documentation.• File Sharing: File sharing tools such as Kazaa and Emule provide search capability by file types. A test was performed by extracting WSDL files using these file sharing tools, and approximately 56 Web services were collected. Unfortunately, peer-to-peer file sharing platforms provide variable network performances, the amount of information being shared is partial, and availability of original sourcescould not be guaranteed at all times.• UDDI Business Registries: Craw ling UBRs was one of the best methods used for collecting Web service information. There are several UBRs that currently exist that were used for this method including: Microsoft, XMethods, SAP, National Biological Information Infrastructure (NBII), among others. Using this method, we were able to retrieve 1248 Web services and collect information about them.Another possible method for collecting Web service information is to crawl Web-based service directories or search engines such as Woogle [19], WebServiceList [1], RemoteMethods [2], or others. However, due to the fact that there is no public access to these directories that contain Web service listings, we excluded this method from WSCE. In addition, majority of the Web services listed within these directories were already available through the Web (via wsdl files) or listed in existing UBRs. Unfortunately, many of these Web-based service directories do not adhere to the majority of Web services’ standards, and therefore, it becomes impractical to us e them for crawling Web services. Although the above-mentioned methods provide possible ways to collect Web service information, UBR sremain the best existing approach for WSCE. However, there are some challenges associated with collecting information about Web services. One of these challenges is the insufficient textual description associated with Web services which makes the task of indexing them very complex and tricky at many instances. To overcome this issue, WSCE uses a combination of indexing approaches by analyzing Web services based on different measures that make the crawling of Web services much more achievable.5.2. Collection of Web ServicesCollecting Web services’ data is not the key element that leads to an effective Web services’ discovery, but how this data is stored. The fact that Web services’ data is spread all over existing search engines’ databases, accessible UBRs, or file sharing platforms does not mean that clients are able to find Web services without difficulties. However, ma king this Web services’ data available through a standard, universal access point that is capable of aggregating this data from various sources and enabling clients to execute search queries tailored to their requirements via a Web services’ search engine interface facilitated by a Web services’ crawler engine or WSCE is a key element for enhancing the discovery of Web services and accelerating their adoption.Designing a crawler engine for Web services’ discovery is very complex and requires special attention particularly looking at the current Web crawler designs. When designing WSCE, it became apparent that many of the existing information retrieval models that serve as basis for Web crawlers may not be very suitable when it comes to Web services due to key differences that exist between Web services andWeb pages including:• Web pages often contain long textual information while Web services have very brief textual descriptions of what they offer or little documentation on how it can be invoked. This lack of textual information makes keyword-based searches vulnerable to returning irrelevant search results and therefore become a very primitive method for effectively discovering Web services.• Web pages primarily contain plain text which allows sear ch engines to take advantage of information retrieval methods such as finding document and term frequencies. However, Web services’ structure is much more complex than those of Web pages, and only a small portion of plain text is often provided either in UBRs or service interfaces.• Web pages are built using HTML which has a predefined or known set of tags. However, Web service definitions are much more abstract. Web service interface information such as message names, operation and parameter names within Web services can vary significantly which makes the finding of any trends, relationships, or patterns within them very difficult and requires excessive domain knowledge in XML schemas and namespaces.Applying Web crawling techniques to Web service definitions or WSDL files, and UBRs may not be efficient, and the outcome of our research was an enhanced crawler targeted for Web services. The crawler should be able to handle WSDL files, and UBR information concurrently. In addition to that, the crawler should be able to collect this information from multiple registries and store it into a centralized repository, such as the Web Service Storage (WSS).WSS serves as a central repository where data and templates discovered by WSCE are stored. WSS represents a collection or catalogue of all business entries and related Web services as shown in Figure 2. WSS plays a major role in the discovery of Web services in many ways: first: it enables for the identification of Web services through service descriptions and origination, processing specification, device orientation, binding instructions, and registry APIs; second: it allows for the query and location of Web services through classification; third: it provides the means for Web service life-cycle tracking; fourth: it provides dynamic Web service invocation and binding; fifth: it can be used to execute advanced or range-based search queries; sixth: it enables the provisioning of Web services by capturing multiple data types, and seventh: it can be used by businesses to perform Web service notifications. The WSS also takes advantage of some of the existing mechanisms that utilize context-aware information using artificial neural networks [18] for enhancing the discovery process 5.3. WSCE ArchitectureWSCE is an automated program that resides on a server, roams across accessible UDDI business registries, and extracts information about all businesses and theirrelated Web services to a local repository or WSS [11]. Our approach in implementing this conceptual discovery model shown in Figure 2 is a process-per-service design in which WSCE runs each Web service crawl as a process that is managed and handled by the WSCE’s Event and Load Manager (ELM). The crawling process starts with dispensing Web services into the Ws ToCrawl queue. WSCE’s Seed Ws list contains hundreds or thousands of business keys, services keys, and the corresponding UBR inquiry location.WSCE begins with a collection of Web services and loops through taking a Web service from WsToCrawl queue. WSCE then starts analyzing Web serviceinformation located within the registry, tModels, and any associated WSDL information through the Analysis Module. WSCE stores this information in the WSS after processing it through the Indexing Module. After completion, WSCE adds an entry of the Web service (using unique identifier such as serviceKey) into VisitedWs queue.5.3.1. Building Crawler Queues: WsToCrawlConceptually, WSCE examines all Web services from accessible UBRs through businessKeys and serviceKeys and checks whether any new businessKeys orserviceKeys are extracted. If the businessKey or serviceKey has already been fetched, it is discarded; otherwise, it is added to the WsToCrawl queue. WSCE contains a queue of VisitedWs which includes a list of crawled Web services. In cases the crawler process fails or crashes, information is lost, and therefore, Event and Load Manager (ELM) handles such scenarios and updates the WsToCrawl through the Extract Ws component.5.3.2. Event and Load ManagerEvent and Load Manager (ELM) is responsible for managing queues of Web services that are to be fetched and checks out enough work from accessible UBRs. This process continues until the crawler has collected a sufficient number of Web services. ELM also keeps track of Web-based fetching through WSDL files. ELM also determines any changes that occurred within UBRs and allows the fetching of newly or recently modified information. This can provide extremely helpful information that serve as the basis for documenting services across stages for service life-cycle tracking. In addition, ELM is able to automate registry maintenance tasks such as moderation, monitoring, or enforcing service expirations. ELM uses the GetWs component to communicate with Extract Ws and WSS for parsing and storing purposes.5.3.3. Request WsRequest Ws component begins the fetching process by querying the appropriate UBR for collecting basic business information such as businessKey, name, contact, description, identifiers, associated categories, and others. Request WS component makes calls to the Indexing and Analysis Modules for collecting business related information. This information is then handled by ELM which acts as a filter and an authorization point prior to sending any information to the local repository, or WSS. Once business information is stored, Extract Ws is used to extract business related Web services. Request Ws enables WSS to collect business information which can later be used for performing search queries that are business related.5.3.4. Extract WsExtract Ws receives all serviceKeys associated with a given businessKey. Extract Ws parses Web service information such as serviceKey, service name, description, categories, and others. It also extracts a list of all associated bindingTemplates. Extract Ws parses all technical information associated with given Web services, discovers how to access and use them. The Extract Ws depends on the Analysis Module to determine any mapping differences between bindingTemplates and information contained within WSDL files stored in WSS.5.3.5. Analysis ModuleAnalysis Module (AM) serves two main purposes: analyzing tModels withinUBRs and WSDL file structures. In terms of tModels, AM extracts a list of entries that correspond to tModels associated with bindingTemplates. In addition, AM extracts information on the categoryBag which contains data about the Web service and its binding (i.e. test binding, production binding, and others). AM discovers any similarities between analyzed tModel and other tModels of other Web services of the same or different business entities. In addition, AM examines information such as the tModelKey, name, description, overviewDoc (including any associated overview URL), identifierBag, and categoryBag for the purpose of creating a topology of relationships to similar tModels of interest.AM analyzes the syntactical structure of WSDL files to collect information on messages, input and output parameters, attributes, values, and others. All WSDL files are stored in the WSS and ELM keeps track of file dates (i.e. file creation date). In cases an update is made to WSDL files, ELM stores the new WSDL file and archives the old one. Archiving WSDL files can become very helpful for versioning support for Web services and can also serve Web service life-cycle analysis. AM also extracts information such as the type of system used to describe the data model (i.e. any XML schemas, UBL, etc…), the location of where the service provider resides, how to bind WSDL to protocols (i.e. SOAP, HTTP POST/GET, and MIME), and groups of locations that can be used to access the service. This information can be very helpful for developers or businesses for efficiently locating appropriate business applications,enhancing the business-to-business integration process, or for service composition.5.3.6. Indexing ModuleIndexing Module (IM) depends on information retrieval techniques for indexing textual information contained within WSDL files, and UBRs. IM builds an index of keywords on documents, uses vector space model technique proposed by Salton [25] for finding terms with high frequencies, classifies information into appropriate associations, and processes textual information for possible integration with other processes such as ranking and analysis. IM enables WSCE to provide businesses with an effective analysis tool for measuring e-commerce performance, Web service effectiveness, and other marketing tactics. IM can also serve many purposes and allows for the possible integration with external components (i.e. Semantic Web service tools) for building ontologies to extend the usefulness of WSCE.6. Experiments and ResultsData used in this work are based on actual implementations of existing UBRs including: Microsoft, Microsoft Test, , and SAP. To compare performance of existing UBRs (in terms of execution time) to the WSRB framework, we measured the average time when performing search queries. The ratio has a direct effect on measurements since each UBR contains different number of Web services published. Therefore, only the top 10% of the dataset matched is used. Results are。

搜索应用英语作文模板

搜索应用英语作文模板英文回答：1. Introduction。

In the fast-paced digital age, search engines have become an indispensable tool for accessing information quickly and efficiently. With countless search engines to choose from, it can be challenging to find the one that best meets your needs. This comprehensive guide will provide you with a definitive overview of the top search engines available today, along with tips and tricks to optimize your search experience.2. Google Search。

Google Search is the undisputed leader in the search engine market, with a global market share of over 90%. Its advanced algorithms and vast index of web pages make it the go-to choice for most users. It offers a wide range offeatures, including:Personalized results: Google tailors your searchresults based on your browsing history and preferences.Image and video search: Google has dedicated sections for searching for images and videos.Instant results: Google displays search results as you type, making it faster to find what you're looking for.Auto-suggest: Google suggests search terms as you type, saving you time and effort.3. Bing Search。

搜索引擎中英文对照外文翻译文献

Permission to make digital or hard copies of all or part of this work for personal or classroomuse is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.g and Browsing Behavior of
Advanced Search Engine Users
BSTRACT
One way to help all users of commercial Web search engines be more successful in their searches is to better understand what those users with greater search expertise are doing, and use this knowledge to benefit everyone. In this paper we study the interaction logs of advanced search engine users (and those not so advanced) to better understand how these user groups search. The results show that there are marked differences in the queries, result clicks, post-query browsing, and search success of users we classify as advanced (based on their use of query operators), relative to those classified as non-advanced. Our findings have implications for how advanced users should be supported during their searches, and how their interactions could be used to help searchers of all experience levels find more relevant information and learn improved searching strategies.

网络爬虫技术探究毕业论文

毕业论文题目网络爬虫技术探究英文题目Web Spiders Technology Explore信息科学与技术学院学士学位论文毕业设计（论文）原创性声明和使用授权说明原创性声明本人郑重承诺：所呈交的毕业设计（论文），是我个人在指导教师的指导下进行的研究工作及取得的成果。

尽我所知，除文中特别加以标注和致谢的地方外，不包含其他人或组织已经发表或公布过的研究成果，也不包含我为获得及其它教育机构的学位或学历而使用过的材料。

对本研究提供过帮助和做出过贡献的个人或集体，均已在文中作了明确的说明并表示了谢意。

作者签名：日期：指导教师签名：日期：使用授权说明本人完全了解大学关于收集、保存、使用毕业设计（论文）的规定，即：按照学校要求提交毕业设计（论文）的印刷本和电子版本；学校有权保存毕业设计（论文）的印刷本和电子版，并提供目录检索与阅览服务；学校可以采用影印、缩印、数字化或其它复制手段保存论文；在不以赢利为目的前提下，学校可以公布论文的部分或全部内容。

作者签名：日期：信息科学与技术学院学士学位论文学位论文原创性声明本人郑重声明：所呈交的论文是本人在导师的指导下独立进行研究所取得的研究成果。

除了文中特别加以标注引用的内容外，本论文不包含任何其他个人或集体已经发表或撰写的成果作品。

对本文的研究做出重要贡献的个人和集体，均已在文中以明确方式标明。

本人完全意识到本声明的法律后果由本人承担。

作者签名：日期：年月日学位论文版权使用授权书本学位论文作者完全了解学校有关保留、使用学位论文的规定，同意学校保留并向国家有关部门或机构送交论文的复印件和电子版，允许论文被查阅和借阅。

本人授权大学可以将本学位论文的全部或部分内容编入有关数据库进行检索，可以采用影印、缩印或扫描等复制手段保存和汇编本学位论文。

涉密论文按学校规定处理。

作者签名：日期：年月日信息科学与技术学院学士学位论文导师签名：日期：年月日信息科学与技术学院学士学位论文注意事项1.设计（论文）的内容包括：1）封面（按教务处制定的标准封面格式制作）2）原创性声明3）中文摘要（300字左右）、关键词4）外文摘要、关键词5）目次页（附件不统一编入）6）论文主体部分：引言（或绪论）、正文、结论7）参考文献8）致谢9）附录（对论文支持必要时）2.论文字数要求：理工类设计（论文）正文字数不少于1万字（不包括图纸、程序清单等），文科类论文正文字数不少于1.2万字。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

外文译文正文:探索搜索引擎爬虫随着网络难以想象的急剧扩张,从Web中提取知识逐渐成为一种受欢迎的途径。

这是由于网络的便利和丰富的信息。

通常需要使用基于网络爬行的搜索引擎来找到我们需要的网页。

本文描述了搜索引擎的基本工作任务。

概述了搜索引擎与网络爬虫之间的联系。

关键词:爬行,集中爬行,网络爬虫导言在网络上是一种服务,驻留在链接到互联网的电脑上,并允许最终用户访问是用标准的接口软件的计算机中的存储数据。

万维网是获取访问网络信息的宇宙,是人类知识的体现。

搜索引擎是一个计算机程序,它能够从网上搜索并扫描特定的关键字,尤其是商业服务,返回的它们发现的资料清单,抓取搜索引擎数据库的信息主要通过接收想要发表自己作品的作家的清单或者通过“网络爬虫”、“蜘蛛”或“机器人”漫游互联网捕捉他们访问过的页面的相关链接和信息。

网络爬虫是一个能够自动获取万维网的信息程序。

网页检索是一个重要的研究课题。

爬虫是软件组件,它访问网络中的树结构,按照一定的策略,搜索并收集当地库中检索对象。

本文的其余部分组织如下:第二节中,我们解释了Web爬虫背景细节。

在第3节中,我们讨论爬虫的类型,在第4节中我们将介绍网络爬虫的工作原理。

在第5节,我们搭建两个网络爬虫的先进技术。

在第6节我们讨论如何挑选更有趣的问题。

调查网络爬虫网络爬虫几乎同网络本身一样古老。

第一个网络爬虫,马修格雷浏览者,写于1993年春天,大约正好与首次发布的OCSA Mosaic网络同时发布。

在最初的两次万维网会议上发表了许多关于网络爬虫的文章。

然而,在当时,网络i现在要小到三到四个数量级,所以这些系统没有处理好当今网络中一次爬网固有的缩放问题。

显然,所有常用的搜索引擎使用的爬网程序必须扩展到网络的实质性部分。

但是,由于搜索引擎是一项竞争性质的业务,这些抓取的设计并没有公开描述。

有两个明显的例外:股沟履带式和网络档案履带式。

不幸的是,说明这些文献中的爬虫程序是太简洁以至于能够进行重复。

原谷歌爬虫(在斯坦福大学开发的)组件包括五个功能不同的运行流程。

服务器进程读取一个URL出来然后通过履带式转发到多个进程。

每个履带进程运行在不同的机器,是单线程的,使用异步I/O采用并行的模式从最多300个网站来抓取数据。

爬虫传输下载的页面到一个能进行网页压缩和存储的存储服务器进程。

然后这些页面由一个索引进程进行解读,从6>HTML页面中提取链接并将他们保存到不同的磁盘文件中。

一个URL 解析器进程读取链接文件,并将相对的网址进行存储,并保存了完整的URL到磁盘文件然后就可以进行读取了。

通常情况下,因为三到四个爬虫程序被使用,所有整个系统需要四到八个完整的系统。

在谷歌将网络爬虫转变为一个商业成果之后,在斯坦福大学仍然在进行这方面的研究。

斯坦福Web Base项目组已实施一个高性能的分布式爬虫,具有每秒可以下载50到100个文件的能力。

Cho等人又发展了文件更新频率的模型以报告爬行下载集合的增量。

互联网档案馆还利用多台计算机来检索网页。

每个爬虫程序被分配到64个站点进行检索,并没有网站被分配到一个以上的爬虫。

每个单线程爬虫程序读取到其指定网站网址列表的种子从磁盘到每个站点的队列,然后用异步I/O来从这些队列同时抓取网页。

一旦一个页面下载完毕,爬虫提取包含在其中的链接。

如果一个链接提到它被包含在页面中的网站,它被添加到适当的站点排队;否则被记录在磁盘。

每隔一段时间,合并成一个批处理程序的具体地点的种子设置这些记录“跨网站”的网址,过滤掉进程中的重复项。

Web Fountian爬虫程序分享了魔卡托结构的几个特点:它是分布式的,连续,有礼貌,可配置的。

不幸的是,写这篇文章,WebFountain是在其发展的早期阶段,并尚未公布其性能数据。

搜索引擎基本类型基于爬虫的搜索引擎基于爬虫的搜索引擎自动创建自己的清单。

计算机程序“蜘蛛”建立他们没有通过人的选择。

他们不是通过学术分类进行组织,而是通过计算机算法把所有的网页排列出来。

这种类型的搜索引擎往往是巨大的,常常能取得了大龄的信息,它允许复杂的搜索范围内搜索以前的搜索的结果,使你能够改进搜索结果。

这种类型的搜素引擎包含了网页中所有的链接。

所以人们可以通过匹配的单词找到他们想要的网页。

B.人力页面目录这是通过人类选择建造的,即他们依赖人类创建列表。

他们以主题类别和科目类别做网页的分类。

人力驱动的目录,永远不会包含他们网页所有链接的。

他们是小于大多数搜索引擎的。

C.混合搜索引擎一种混合搜索引擎以传统的文字为导向,如谷歌搜索引擎,如雅虎目录搜索为基础的搜索引擎,其中每个方案比较操作的元数据集不同,当其元数据的主要资料来自一个网络爬虫或分类分析所有互联网文字和用户的搜索查询。

与此相反,混合搜索引擎可能有一个或多个元数据集,例如,包括来自客户端的网络元数据,将所得的情境模型中的客户端上下文元数据俩认识这两个机构。

爬虫的工作原理网络爬虫是搜索引擎必不可少的组成部分:运行一个网络爬虫是一个极具挑战的任务。

有技术和可靠性问题,更重要的是有社会问题。

爬虫是最脆弱的应用程序,因为它涉及到交互的几百几千个Web服务器和各种域名服务器,这些都超出了系统的控制。

网页检索速度不仅由一个人的自己互联网连接速度有关,同时也受到了要抓取的网站的速度。

特别是如果一个是从多个服务器抓取的网站,总爬行时间可以大大减少,如果许多下载是并行完成。

虽然有众多的网络爬虫应用程序,他们在核心内容上基本上是相同的。

以下是应用程序网络爬虫的工作过程:下载网页通过下载的页面解析和检索所有的联系对于每一个环节检索,重复这个过程。

网络爬虫可用于通过对完整的网站的局域网进行抓取。

可以指定一个启动程序爬虫跟随在HTML页中找到所有链接。

这通常导致更多的链接,这之后将再次跟随,等等。

一个网站可以被视为一个树状结构看,根本是启动程序,在这根的HTML页的所有链接是根子链接。

随后循环获得更多的链接。

一个网页服务器提供若干网址清单给爬虫。

网络爬虫开始通过解析一个指定的网页,标注该网页指向其他网站页面的超文本链接。

然后他们分析这些网页之间新的联系,等等循环。

网络爬虫软件不实际移动到各地不同的互联网上的电脑,而是像电脑病毒一样通过智能代理进行。

每个爬虫每次大概打开大约300个链接。

这是索引网页必须的足够快的速度。

一个爬虫互留在一个机器。

爬虫只是简单的将HTTP请求的文件发送到互联网的其他机器,就像一个网上浏览器的链接,当用户点击。

所有的爬虫事实上是自动化追寻链接的过程。

网页检索可视为一个队列处理的项目。

当检索器访问一个网页,它提取到其他网页的链接。

因此,爬虫置身于这些网址的一个队列的末尾,并继续爬行到下一个页面,然后它从队列前面删除。

资源约束爬行消耗资源:下载页面的带宽,支持私人数据结构存储的内存,来评价和选择网址的CPU,以及存储文本和链接以及其他持久性数据的磁盘存储。

B.机器人协议机器人文件给出排除一部分的网站被抓取的指令。

类似地,一个简单的文本文件可以提供有关的新鲜和出版对象的流行信息。

对信息允许抓取工具优化其收集的数据刷新策略以及更换对象的政策。

C.元搜索引擎一个元搜索引擎是一种没有它自己的网页数据库的搜索引擎。

它发出的搜索支持其他搜索引擎所有的数据库,从所有的搜索引擎查询并为用户提供的结果。

较少的元搜索可以让您深入到最大,最有用的搜索引擎数据库。

他们往往返回最小或免费的搜索引擎和其他免费目录并且通常是小和高度商业化的结果。

爬行技术A:主题爬行一个通用的网络爬虫根据一个URL的特点设置来收集网页。

凡为主题爬虫的设计有一个特定的主题的文件,从而减少了网络流量和下载量。

主题爬虫的目标是有选择地寻找相关的网页的主题进行预先定义的设置。

指定的主题不使用关键字,但使用示范文件。

不是所有的收集和索引访问的Web文件能够回答所有可能的特殊查询,有一个主题爬虫爬行分析其抓起边界,找到链接,很可能是最适合抓取相关,并避免不相关的区域的Web。

这导致在硬件和网络资源极大地节省,并有助于于保持在最新状态的数据。

主题爬虫有三个主要组成部分一个分类器,这能够判断相关网页,决定抓取链接的拓展,过滤器决定过滤器抓取的网页,以确定优先访问中心次序的措施,以及均受量词和过滤器动态重新配置的优先的控制的爬虫。

最关键的评价是衡量主题爬行收获的比例,这是在抓取过程中有多少比例相关网页被采用和不相干的网页是有效地过滤掉,这收获率最高,否则主题爬虫会花很多时间在消除不相关的网页,而且使用一个普通的爬虫可能会更好。

B:分布式检索检索网络是一个挑战,因为它的成长性和动态性。

随着网络规模越来越大,已经称为必须并行处理检索程序,以完成在合理的时间内下载网页。

一个单一的检索程序,即使在是用多线程在大型引擎需要获取大量数据的快速上也存在不足。

当一个爬虫通过一个单一的物理链接被所有被提取的数据所使用,通过分配多种抓取活动的进程可以帮助建立一个可扩展的易于配置的系统,它具有容错性的系统。

拆分负载降低硬件要求,并在同一时间增加整体下载速度和可靠性。

每个任务都是在一个完全分布式的方式,也就是说,没有中央协调器的存在。

挑战更多“有趣”对象的问题搜索引擎被认为是一个热门话题,因为它收集用户查询记录。

检索程序优先抓取网站根据一些重要的度量,例如相似性(对有引导的查询),返回链接数网页排名或者其他组合/变化最精Najork等。

表明,首先考虑广泛优先搜索收集高品质页面,并提出一种网页排名。

然而,目前,搜索策略是无法准确选择“最佳”路径,因为他们的认识仅仅是局部的。

由于在互联网上可得到的信息数量非常庞大目前不可能实现全面的索引。

因此,必须采用剪裁策略。

主题爬行和智能检索,是发现相关的特定主题或主题集网页技术。

结论在本文中,我们得出这样的结论实现完整的网络爬行覆盖是不可能实现,因为受限于整个万维网的巨大规模和资源的可用性。

通常是通过一种阈值的设置(网站访问人数,网站上树的水平,与主题等规定),以限制对选定的网站上进行抓取的过程。

此信息是在搜索引擎可用于存储/刷新最相关和最新更新的网页,从而提高检索的内容质量,同时减少陈旧的内容和缺页。

外文译文原文:Discussion on Web Crawlers of Search EngineAbstract-With the precipitous expansion of the Web,extracting knowledge from the Web is becoming gradually im portant and popular.This is due to the Web’s convenience and richness of information.To find Web pages, one typically uses search engines that are based on the Web crawling framework.This paper describes the basic task performed search engine.Overview of how the Web crawlers are related with search engine Keywords Distributed Crawling, Focused Crawling,Web CrawlersⅠ.INTRODUCTION on the Web is a service that resides on computers that are connected to the Internet and allows end users to access data that is stored on the computers using standard interface software. The World Wide Web is the universe of network-accessible information,an embodiment of human knowledge Search engine is a computer program that searches for particular keywords and returns a list of documents in which they were found,especially a commercial service that scans documents on the Internet. A search engine finds information for its database by accepting listings sent it by authors who want exposure,or by getting the information from their “Web crawlers,””spiders,” or “robots,”programs that roam the Internet storing links to and information about each page they visit Web Crawler is a program, which fetches information from the World Wide Web in an automated manner.Webcrawling is an important research issue. Crawlers are software components, which visit portions of Web trees, according to certain strategies,and collect retrieved objects in local repositories The rest of the paper is organized as: in Section 2 we explain the background details of Web crawlers.In Section 3 we discuss on types of crawler, in Section 4 we will explain the working of Web crawler. In Section 5 we cover the two advanced techniques of Web crawlers. In the Section 6 we discuss the problem of selecting more interesting pages.Ⅱ.SURVEY OF WEB CRAWLERSWeb crawlers are almost as old as the Web itself.The first crawler,Matthew Gray’s Wanderer, was written in the spring of 1993,roughly coinciding with the first release Mosaic.Several papers about Web crawling were presented at the first two World Wide Web conference.However,at the time, the Web was three to four orders of magnitude smaller than it is today,so those systems did not address the scaling problems inherent in a crawl of today’s Web Obviously, all of the popular search engines use crawlers that must scale up to substantial portions of the Web. However, due to the competitive nature of the search engine business, the designs of these crawlers have not been publicly described There are two notable exceptions:the Goole crawler and the Internet Archive crawler.Unfortunately,the descriptions of these crawlers in the literature are too terse to enable reproducibility The original Google crawler developed at Stanford consisted of fivefunctional components running in different processes. A URL server process read URLs out of a file and forwarded them to multiple crawler processes.Each crawler process ran on a different machine,was single-threaded,and used asynchronous I/O to fetch data from up to 300 Web servers in parallel. The crawlers transmitted downloaded pages to a single Store Server process, which compressed the pages and stored them to disk.The page were then read back from disk by an indexer process, which extracted links from HTML pages and saved them to a different disk file.A URLs resolver process read the link file, relative the URLs contained there in, and saved the absolute URLs to the disk file that was read by the URL server. Typically,three to four crawler machines were used, so the entire system required between four and eight machines Research on Web crawling continues at Stanford even after Google has been transformed into a commercial effort.The Stanford Web Base project has implemented a high performance distributed crawler,capable of downloading 50 to 100 documents per second.Cho and others have also developed models of documents update frequencies to inform the download schedule of incremental crawlers The Internet Archive also used multiple machines to crawl the Web.Each crawler process was assigned up to 64 sites to crawl, and no site was assigned to more than one crawler.Each single-threaded crawler process read a list of seed URLs for its assigned sited from disk int per-site queues,and then used asynchronous I/O to fetch pages fromthese queues in parallel. Once a page was downloaded, the crawler extracted the links contained in it.If a link referred to the site of the page it was contained in, it was added to the appropriate site queue;otherwise it was logged to disk .Periodically, a batch process merged these logged “cross-sit” URLs into the site--specific seed sets, filtering out duplicates in the process.The Web Fountain crawler shares several of Mercator’s characteristics:it is distributed,continuousthe authors use the term”incremental”,polite, and configurable.Unfortunately,as of this writing,Web Fountain is in the early stages of its development, and data about its performance is not yet available.Ⅲ.BASIC TYPESS OF SEARCH ENGINECrawler Based Search EnginesCrawler based search engines create their listings automaticallyputer programs ‘spider’ build them not by human selection. They are not organized by subject categories; a computer algorithm ranks all pages. Such kinds of search engines are huge and often retrieve a lot of information -- for complex searches it allows to search within the results of a previous search and enables you to refine search results. These types of search engines contain full text of the Web pages they link to .So one cann find pages by matching words in the pages one wants;B. Human Powered DirectoriesThese are built by human selection i.e. They depend on humans to create listings. They are organized into subject categories and subjects do classification of pages.Human powered directories never contain full text of the Web page they link to .They are smaller than most search engines.C.Hybrid Search EngineA hybrid search engine differs from traditional text oriented search engine such as Google or a directory-based search engine such as Yahoo in which each program operates by comparing a set of meta data, the primary corpus being the meta data derived from a Web crawler or taxonomic analysis of all internet text,and a user search query.In contrast, hybrid search engine may use these two bodies of meta data in addition to one or more sets of meta data that can, for example, include situational meta data derived from the client’s network that would model the context awareness of the client.Ⅳ.WORKING OF A WEB CRAWLERWeb crawlers are an essential component to search engines;running a Web crawler is a challenging task.There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of Web servers and various name servers, which are all beyond the control of the system. Web crawling speed is governed not only by the speed of one’s own Internet connection ,but also by thespeed of the sites that are to be crawled.Especially if one is a crawling site from multiple servers, the total crawling time can be significantly reduced,if many downloads are done in parallel.Despite the numerous applications for Web crawlers,at the core they are all fundamentally the same. Following is the process by which Web crawlers work:Download the Web page.Parse through the downloaded page and retrieve all the links.For each link retrieved,repeat the process The Web crawler can be used for crawling through a whole site on the Inter-/Intranet You specify a start-URL and the Crawler follows all links found in that HTML page.This usually leads to more links,which will be followed again, and so on.A site can be seen as a tree-structure,the root is the start-URL;all links in that root-HTML-page are direct sons of the root. Subsequent links are then sons of the previous sons A single URL Server serves lists of URLs to a number of crawlers.Web crawler starts by parsing a specified Web page,noting any hypertext links on that page that point to other Web pages.They then parse those pages for new links,and so on,recursively.Web Crawler software doesn’t actually move around to different computers on the Internet, as viruses or intelligent agents do. Each crawler keeps roughly 300 connections open at once.This is necessary to retrieve Web page at a fast enough pace. A crawler resides on a single machine. Thecrawler simply sends HTTP requests for documents to other machines on the Internet, just as a Web browser does when the user clicks on links. All the crawler really does is to automate the process of following links.Web crawling can be regarded as processing items in a queue. When the crawler visits a Web page,it extracts links to other Web pages.So the crawler puts these URLs at the end of a queue, and continues crawling to a URL that it removes from the front of the queue.Resource ConstraintsCrawlers consume resources: network bandwidth to download pages,memory to maintain private data structures in support of their algorithms,CUP to evaluate and select URLs,and disk storage to store the text and links of fetched pages as well as other persistent data.B.Robot ProtocolThe robot.txt file gives directives for excluding a portion of a Web site to be crawled. Analogously,a simple text file can furnish information about the freshness and popularity fo published objects.This information permits a crawler to optimize its strategy for refreshing collected data as well as replacing object policy.C.Meta Search EngineA meta-search engine is the kind of search engine that does not have its own database of Web pages.It sends search terms to the databases maintained by other search engines and gives users the result that come from all the search engines queried. Fewer meta searchers allow you to delve into the largest, most useful search engine databases. They tend to return results from smaller add/or search engines andmiscellaneous free directories, often small and highly commercial Ⅴ.CRAWLING TECHNIQUESFocused CrawlingA general purpose Web crawler gathers as many pages as it can from a particular set of URL’s.Where as a focused crawler is designed to only gather documents on a specific topic,thus reducing the amount of network traffic and downloads.The goal of the focused crawler is to selectively seek out pages that are relevant to a predefined set of topics.The topics re specified not using keywords,but using exemplary documents Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl,and avoids irrelevant regions of the Web This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. The focused crawler has three main components;: a classifier which makes relevance judgments on pages,crawled to decide on link expansion,a distiller which determines a measure of centrality of crawled pages to determine visit priorities, and a crawler with dynamically reconfigurable priority controls which is governed by the classifier and distiller The most crucial evaluation of focused crawling is to measure the harvest ratio, which is rate at which relevant pages are acquired and irrelevant pages are effectively filtered off from the crawl.This harvest ratio must be high ,otherwise the focusedcrawler would spend a lot of time merely eliminating irrelevant pages, and it may be better to use an ordinary crawler instead.B.Distributed CrawlingIndexing the Web is a challenge due to its growing and dynamic nature.As the size of the Web sis growing it has become imperative to parallelize the crawling process in order to finish downloading the pages in a reasonable amount of time.A single crawling process even if multithreading is used will be insufficient for large - scale engines that need to fetch large amounts of data rapidly.When a single centralized crawler is used all the fetched data passes through a single physical link.Distributing the crawling activity via multiple processes can help build a scalable, easily configurable system,which is fault tolerant system.Splitting the load decreases hardware requirements and at the same time increases the overall download speed and reliability. Each task is performed in a fully distributed fashion,that is ,no central coordinator exits.Ⅵ.PROBLEM OF SELECTING MORE “INTERESTING”A search engine is aware of hot topics because it collects user queries.The crawling process prioritizes URLs according to an importance metric such as similarityto a driving query,back-link count,Page Rank or their combinations/variations.Recently Najork et al. Showed that breadth-first search collects high-quality pages first and suggested a variant of Page Rank.However,at the moment,search strategies are unable to exactly selectthe “best” paths because their knowledge is only partial.Due to theenormous amount of information available on the Internet a total-crawlingis at the moment impossible,thus,prune strategies must be applied.Focusedcrawling and intelligent crawling,are techniques for discovering Webpages relevant to a specific topic or set of topics.CONCLUSIONIn this paper we conclude that complete web crawlingcoverage cannot be achieved, due to the vast size of the whole and toresource ually a kind of threshold is set upnumber of visited URLs, level in the website tree,compliance with a t。