基于XML和DOM技术的Web信息抽取模型

合集下载

基于XML的Web信息采集系统设计与实现

第33卷第2期2017年3月齐齐哈尔大学学报（自然科学版）Journal of Qiqihar University(Natural Science Edition)Vol.33,No.2March,2017基于XML的Web信息采集系统设计与实现王磊(蚌埠学院计算机工程学院，安徽蚌埠233030)摘要：设计基于X M L的W eb信息采集系统，抽取出H T M L页面中半结构化数据后，将清洗、解析后的数据置入M ySQ L数据库中。

通过将类型相似页面的节点信息和字段描述配置于X M L文件中，改进了网页对应独立抽取模板的方法，有效地提高了W eb信息采集的效率和准确性。

实验结果表明，基于X M L的W eb信息采集系统能够满足信息抽取的需求。

关键词：W eb信息采集；抽取规则；XML中图分类号：TP393.09 文献标志码：兴文章编号：1007-984X(2017)02-0025-04网络迅猛发展导致当今社会高度的信息化，海量Web信息成为一种资源仓库。

在合理时间内高效地采集大量有效数据，可以帮助用户解决决策和选择问题。

W eb信息采集就是将网页中用户感兴趣的信息准确地抽取出来，以更具有语义、更结构化的形式保存下来，以供用户查询或其它应用程序利用。

文献[1]从用户基站日志中所包含的地理位置信息中构建移动用户的行为画像，文献[2]提出了一种基于节点属性与正文内容的海量W eb信息抽取方法。

在影视行业中，制作团队基于对用户行为偏好的分析，预测和设计剧情，找出观众最喜欢的演员，甚至可以预测票房。

可以看出不同领域对于数据要求的数量和质量各不相同，但是对于网络数据信息的检索、数据深层次的挖掘以及关键字查询效率等方面具有迫切的要求。

目前流行的网络舆情分析和监控[3]，移动对象轨迹挖掘[4]以及网页新闻A PP推送等行为，均基于W eb信息采集基础之上。

1系统设计可扩展标记语言(X M L f是由W3C设计并推荐的新一代标记语言，具有灵活性和可扩展性。

基于DOM树与领域本体的Web抽取方法

图ｌ所示。
分领域的巨大资源库。研究表明，ＤｅＷｂ页面信息介于ｅｐｅ６０Ｂ９５Ｂ之间Ｊ６８０Ｔ－１０Ｔ８。为了将同类信息集成，需要从
［ｙｗｏｄｉａｔｍａｉｅｔｃｏ；ＯＭｅ；ｏｉｎｏｏｙｄｔａｅｏｉｏｉｇｓｌｔｅｔｈｎＫｅｒｓｕｏｔｘｒｔｎＤｃａｉｌｆｅｄｍａｏｔｌｇ；ａａｒａｐｓｎｎ；ｉｅｅｃｉｇｎｉｔｍｐｍａｒＤＯＩ１．６／ｉｎ１０ —４８２１．５１：０３９ｊｓ．００３２．２０．５９．ｓ００
关奠诃：自动抽取；ＤＭ树；领域本体；数据区域定位；简单树匹配Ｏ
ＷｅｔａｔｏｅｈｄＢａｅｎＤｏＭｒｅａｄＤｏａｎＯｎｏｏｙｂＥｘｒｃｉｎＭｔｏｓｄ０Ｔｅｎｍｉｔｌｇ
ｃｍｂｉｅｈｅｔｕｔｒｎｈｏｔｎｆＷｅａｅ．ｉｔｏｅｅｃａａｔｒｓｃｆｄｔｏｔｎｎｅＤＯＭｅｏｅｉｈａｅｏｎｓｔｅＷｂｓｒｃｕｅａｄｔｅｃｎｅｔｏｂｐｇｓＴｈｓｍｅｈｄｕｓｓｔｈｒｃｅｉｔｓｏａａｃｎｅｔａｄｔｈｉｈｒｔｅｎｄｓｗｈｃｒｍａｋｄｂｔｏｉｎｏｏｙｌｒｒｏｉｏｎｎａａａｅ．Ａｎｉｒｖｄｓｍｐｅｔｅｔｈｉｇａｇｒｔｍｓｕｅｏｉｅｔｆａａｒｃｒｓｒｅｙｈｅｄｍａｎｏｔｌｇｉａｙｐｓｔｉｇｄｔａｂｉｒｍｐｏｅｉｌｒｅｍａｃｎｌｏｉｈｉｓｄｔｄｎｙｄｔｅｏｄ．ｉＥｘｅｍｅｔｌｅｕｌｈｗａｅＦｍｅｓｒｌｅｏｉｔｏ９％－．７ｈｇｅａａｆｔａｉｉｎｌｍｅｈｄ．ｐｒｉｎａｓｔｓｏｔｔ — ａｕｅｖａｕｆｓｍｅｄｉ２．３ｒｓｈｔｈｈｔｈｓ６６％ｉｈｒｎｔｔｄｔｏａｔｏｓｈｔｈｏｒ

基于VTD-XML的Web数据提取框架说明书

A Framework For Extracting InformationFrom Web Using VTD-XML‘s XPathC. SubhashiniEducation & ResearchInfosysMysore, India.************************Dr.Arti AryaMaster of Computer ApplicationsPES School of Engineering, BangaloreBangalore, India****************Abstract — The exponential growth of WWW (World Wide Web) is the cause for vast pool of information as well as several challenges posed by it, such as extracting potentially useful and unknown information from WWW. Many websites are built with HTML, because of its unstructured layout, it is difficult to obtain effective and precise data from web using HTML. The advent of XML (Extensible Markup Language) proposes a better solution to extract useful knowledge from WWW. Web Data Extraction based on XML Technology solves this problem because XML is a general purpose specification for exchanging data over the Web. In this paper, a framework is suggested to extract the data from the web. Here the semi-structured data in the web page is transformed into well-structured data using standard XML technologies and the new parsing technique called extended VTD-XML (Virtual Token Descriptor for XML) along with Xpath implementation has been used to extract data from the well-structured XML document.Keywords- VTD-XML, Web Content Mining, Web Data extraction, Web Data Mining, XML, Xpath.I.I NTRODUCTIONWorld Wide Web is the comprehensive information pool, but the data available on internet is unstructured or semi-structured and there is a need for extracting useful information from it. World Wide Web is too huge and structures of web pages are complex and it is tough to find the essential data and information. This poses a great challenge how to extract useful information from the web, mostly which is in the form of semi-structured. Moreover, extracting useful information from World Wide Web is necessary, which can lead to the best decision-making [1]. Web information is usually described in the form of HTML. Since it has unstructured layout and it is not suitable for database application [3], it is difficult to process HTML document for extracting data, so it is better to take the full advantages of XML for analyzing and processing the data on the web [2] and XML also separates data structure from layout which gives more suitable data representation [3]. In this paper, a framework is suggested to extract useful information from the web based on XML technologies.The components of the proposed framework include data acquisition, data preprocessing, data conversion, data integration, data extraction and data storage. The paper is organized as follows: Section 2 provides the literature review of all possible data extraction based on XML, Section 3 focuses the related technologies for web data extraction; Section 4 gives the overview of the proposed framework. Section 5 presents the final conclusion and future scope of the proposed framework.II.L ITERATURE S URVEYWeb Data extraction is usually carried out in a documents that is made up of markup language such as HTML or XML. This document represents their inner structure. Web data extraction methods mainly focus onthe text representation or document tree structure. Web data extraction can be divided into two categories, i. Pattern matching, ii. Structure matching [3].In Pattern matching document is accessed as a text and text based approaches (pattern matching, regular expression) are used to access the documents. It accesses only the individual lines of the document not the whole document.Structure matching access the documents as a tree like structure and uses path and relation based approaches between the nodes. This approach access the document as a whole or individual sub-tree, which is in interest of extraction [3].Many researches are available in the literature how extract useful information from the HTML page, which is in Semi-Structured format.Yan Hu. et. al [16] have proposed a generic XML-based Web information extraction solution. This method proposes two key technologies: the XML-based Web data conversion technology which converts the HTML into XHTML document according to XML grammars, builds the XMLDOM tree and DOM-based XPath [18] generation algorithm is developed to generate XPath expression for the desired information nodes when the user marks the information points. Then XSLT template rules are applied to extract the user’s interested information from the XHTML documents and the extraction results are expressed in XML.Hanyang Luo et.al [17] has proposed a wrapper based on XBRL (eXtensible Business Reporting Language)-GL taxonomy to extract financial data from the web. In this, the user extracts information by using XPath as extraction rules and then the information collected are attached with tags using XBRL to generate XBRL instance document that enable the further data mining.Siti Z.Z. Abidin et.al [4] has proposed a prototype tool to extract and classify unstructured data. The prototype architecture includes six important components such as Web - a collection of web pages, Generator that is used to request web services from the target web and also to retrieve data from the web, User Specifies input data to the generator and classifies the results of the data extraction , Converter that converts data from a XML documents to a Multimedia database or from a generator to XML document, XML document act as structured storage for data classification and Multimedia data bases stores different types of data like text, audio and video.Cheng Zheng et. al [2] has used XML technology to convert the HTML pages into XML through XSL transformations[20]. An XSL transformation (XSLT) is a language for transforming XML documents to HTML or XHTML documents. Then these converted XML documents are integrated and then the data extracted from the integrated XML documents are stored to the database through Virtual Token Descriptor (VTD).Jussi Myllymaki [5] has proposed ANDES (A Nifty Data Extraction System), a crawler-based web data extraction framework. It uses XML technologies such as XHTML, XSLT for data extraction and also provides deep web access. It extracts data from the targeted web page as well as navigational web page with the help of manual navigation and extraction rules.Jussi Myllyamaki et. al [6] described a methodology for creating Xpath expressions to extract data virtually from any HTML page. They also specified categories of extraction rules based on their dependence on content, structural or formatting features.In this paper a new extended edition of VTD-XML combining with 64-bit JVM(Java Virtual Machine), which supports XPath-based XML processing has been proposed to extract data from XML documents. This extended edition of VTD-XML makes possible to process giant XML documents (up to 256 GB) in size [10].III.R ELATED T ECHNOLOGIESA. TidyTidy is a HTML syntax checker and pretty printer [14]. It is a freely available product and it corrects common mistakes in HTML documents and produces equivalent documents that are well-formed. Tidy can also be used to render these documents in XHTML (Extensible Hypertext Markup Language), a subset of XML [13].B. XSLT, XPath and XMLXSLT (Extensible Style Sheet Language Transformation) provides the mechanism for converting one data structure into another. This is achieved by applying an XSLT style sheet to the XML document. The style sheet specifies the conversion rules for accessing and transforming the input XML document to a different output format. An XSLT processor is applies the rules defined in the style sheet to the input XML document [15].The XML Path Language (XPath) is a standard for creating expressions and the expression can be used to find specific pieces of information within an XML document [15]. XPath uses path expressions to navigate in XML documents [18].XML is the standard way for exchanging data over the Internet. The motive for choosing XML technology is that, it is most widely adopted technology for information representation and exchange on the WWW [7].C. XML Parsing Theory Based On VTD-XML ModelVTD-XML is a new open source XML processing model. It centers on a "non-extractive" XML processing technique called "Virtual Token Descriptor” [8]. “Non-extractive“ means, the original XML document is read into the memory in binary way and then analyzes the position of every element in this byte array and records some information; the followed traversal operation will be on these records. This VTD-XML is contrast to "extractive" parsing such as DOM, SAX and other old XML processing, which extracts part of the original document, and then creates objects in memory. DOM parses each event using these objects and builds the structure [9].VTD-XML parses the XML document and creates 64-bit binary format VTD record (token) for each event. Through the list of VTD records, the application program may access any desired element. VTD-XML provides higher performance and requires lower resource compared with DOM (VTD-XML only need memory of about 1.3~1.5 times of the original XML document size, compared with DOM’s 5~10 times of that) [11]. VTD-XML provides random access capability and also performs rapid analysis and traversal [9]. The extended edition of VTD-XML combines with 64-bit JVM (Java Virtual Machine) to support XPath-based XML processing to extract data from large XML documents (up to 256 GB) in size [10].IV.P ROPOSED F RAMEWORKThe algorithm used in the proposed framework is as follows.WDE-XML (Web Data Extraction based on XML):Step 1: Data Acquisition: Identify the data source.Step 2: Data Preprocessing: Map it to XHTML through Tidy tool.Step 3: Data Conversion: Find the reference points within the XHTML document and Map the data to XML through XSLTStep 4: Data Integration: Merge the results through XSLTStep 5:Data Extraction: Extract the data using VTD-XML and Xpath‘s implementation.In this paper, only for the last step implementation is provided with the sample xml file.The proposed framework is depicted in Fig 1.Figure. 1 An Overview of proposed frameworkA. Data AcquisitionThe first step is to obtain the web page for data extraction. The data source for extraction can be a data on local disk or data on the network [2].B. Data PreprocessingThe web page in the HTML format is not well-formed because it does not conform to HTML specification. That is, in HTML ignoring closing tag will not give any error message. Therefore first it must be converted into well-structured XHTML format. Tidy is used to repair the broken syntax and produces well-formed XHTML.C. Data ConversionXSL (eXtensible Style Sheet language) is used to convert the XHTML to XML. The conversion is needed because of poor structural of XHTML documents. Even though it is based on the XML syntax structure, still it contains a lot of HTML vocabulary. So an XSL file has to be designed to convert the XHTML to XML. Next to extract information, it is necessary to find the reference point which contains the actual content.D. Data IntegrationData integration allows the users to operate on data effectively. The web site may contain several pages and hyperlinks, so a merging method based on XSLT for several XML has to be designed. First, create a merge document and the sub element in the document states the name of the XML document to be merged. Next, define a style sheet for the corresponding XML document and then apply it to the merge document.E.Data Extraction with extended VTD-XML and XpathData extraction step extracts useful data form the integrated XML document. The technology of extended VTD-XML and XPath has been used for analyzing and processing the XML document. Because VTD-XML is the fastest and only XML parser that allows Xpath to process 256 GB XML document [20]. Using XPath, application can binds only the relevant data items, which avoids wasteful object creation. The XPath-based code can be understood easily and simple to write and debug [12]. But XPath can be applied to parsed tree of XML. So VTD-XML is used to build a tree-like table .There is Java API for VTD-XML which is present at the top level and consists of three components.∙VTDGen (VTD generator) encapsulates the parsing routine that produces the internal parsed representation of XML.∙VTDNav (VTD navigator) is a cursor-based API that allows for DOM-like random access to the hierarchical structure of XML.∙Autopilot is the class that allows for document-order element traversal [19].To use Extended VTD-XML and Xpath, application needs to include com.ximpleware.extended and need 64-bit JVM to take full advantage of extended VTD. The code for VTD-XML’s Xpath implementation is shown Fig.2.import com.ximpleware.extended.*;public class Xpath{public static void main(String[] args)throws Exception{VTDGenHuge vg=new VTDGenHuge();if(vg.parseFile("D:/Sample/input.xml",true,VTDGenHuge.MEM_MAPPED)){VTDNavHuge vnh=vg.getNav();AutoPilotHuge aph=new AutoPilotHuge(vnh);aph.selectXPath("/Employees/Employee/Empname/text()");int i=0;System.out.println("Employee Name");System.out.println("=============");while((i=aph.evalXPath())!=-1){System.out.println(vnh.toString(i));}}}}Figure. 2 Code for VTD-XML Xpath’s ImplementationV.C ONCLUSION AND FUTURE SCOPEThis paper explains how to extract data from the largest source of information World Wide Web. World Wide Web is a huge source of unstructured information. The web pages in the HTML format is converted into XHTML using Tidy which further are processed using XSLT to form well formatted XML documents. Then the XML documents are integrated based on merge method using XSLT. Finally, extended VTD-XML and Xpath has been applied to integrated XML document to extract useful data. A new extended edition of VTD-XML with 64-bit JVM, which supports Xpath-based XML processing, is used. This extended edition of VTD-XML is able to process huge XML documents up to 256 GB in size. Future work includes systematically generating the Xpath expression, when the user marks the interested node, extracting the attributes present in the xml document and storing the extracted data into the databases, which may be analyzed for decision making purposes. This is an ongoing research wherein this technique can be used in actual web pages.R EFERENCES[1]Li L. , Rong Q., “Research of Web Mining Technology based on XML, International Conference on Networks Security, WirelessCommunications and Trusted Computing, 2009, pp.653-656[2]Cheng Z., Yong F.Y.S. : The Implementation of the Web Mining based on XML technology, International Conference onComputational Intelligence and Security,2009 Page(s):84-87[3]Rudy AG.Gultom., “Implemeting Web Data Extraction and Making Mashup with Xtractorz.[4]Siti Z.Z. Abidin., “ Extraction and Classification of Unstructured Data in WebPages for Structured Multimedia[5]Database via XML.[6]Jussi Myllymaki., “Effective Web Data Extraction with Standard XML Technologies.[7]Jussi Myllymaki., “ Robust Web Data Extraction with XML Path Expressions”.[8]Yasser K. ,Katsuhiko G. ,Web Mining Applications and Techniques:, Tokyo Institute of Technology, XML Semantics, pp:169-188.[9]VTD-XML: XML Processing for the Future (Part I),/KB/cs/vtd-xml_examples.aspx[10]Lan X. , Su J., Cai J. VTD-XML-based Design and Implementation of GML Parsing Project.[11]VTD-XML ,/wiki/VTD-XML[12]Chee C., Faisal M.Y., Azhar K. M. :RBStreX: HardwareXML Parser for Embedded System.[13]S CHEMA LESS C#-XML DATA BINDING WITH VTD-XML,/KB/XML/SCHEMALESS_BINDING.ASPX[14]W EB-BASED DATA MINING, /DEVELOPERWORKS/LIBRARY/WA-WBDM/JT IDY O PEN S OURCE S OFTWARE W RITTENI N J AVA, /OPENSOURCE/OPENSOURCESOFTWARE.PHP?ID=407[15]XML and Web services, Unleashed. Pearson Education.[16]Yan H., Yanyan X., Research on Web Information Extraction Based on XML, In Second Intl. Conf. on Genetic and EvolutionaryComputing, Sept. 2008, pp:401-404.[17]Hanyang L., Jinling G., "Web Data Extraction Based on XBRL-GL Taxonomy," In Proc. of 2009 Asia-Pacific Conference onInformation Processing, 2009, vol. 1, pp.358-361,[18]/XPath/xpath_intro.asp.[19]/xml/Article/22219/1954.[20]/。

Web信息抽取技术研究

Web信息抽取技术研究Web信息抽取技术是当前互联网发展中的一个重要研究领域。

在人工智能、大数据时代的今天，信息抽取已经成为获取和处理信息的重要手段。

在众多的信息抽取技术中，Web信息抽取技术占据了十分重要的地位。

本文将围绕这一主题展开。

I. Web信息抽取技术简介Web信息抽取技术是一种自动化信息处理技术，通过网络爬虫、HTML解析、信息提取等技术手段，将Web上的非结构化信息转换为结构化的信息，从而实现对关键信息的提取、分析和应用。

Web信息抽取技术的应用涉及各个领域，如搜索引擎、电子商务、社交网络分析等等。

Web信息抽取技术并不是一个完整的技术体系，而是由多个技术模块组成的集合体。

其中，网络爬虫模块用于获取Web页面，HTML解析模块用于解析Web页面的HTML代码，信息提取模块用于提取目标信息并对其进行分析。

这些技术模块的协同工作，最终实现对Web页面信息的抽取和分析。

II. Web信息抽取技术的应用Web信息抽取技术在各个领域都有广泛的应用。

以下是一些常见的应用场景：1. 搜索引擎搜索引擎是Web信息抽取技术最常见的应用领域之一。

搜索引擎的核心就是对Web页面的信息进行抽取和分析，从而实现搜索引擎对关键词的匹配和检索。

2. 电子商务电子商务领域对Web信息抽取技术的应用非常广泛。

通过对电商网站的产品信息进行抽取和分析，可以实现商品信息的分类、推荐等功能，从而提高电商网站的用户体验。

3. 社交网络分析社交网络分析是近年来发展迅速的一个领域，其中Web信息抽取技术也发挥了重要的作用。

通过对社交网络上用户的信息进行抽取和分析，可以实现社交网络的用户聚类、社区发现等功能。

III. Web信息抽取技术的挑战Web信息抽取技术的应用具有广泛性和复杂性，在应用过程中，面临着一些挑战：1. Web页面结构多样性Web页面的结构十分复杂，有些页面可能包含多个嵌套的表格、DIV等元素，这些元素的层级关系和结构差异非常大，因此Web信息抽取技术需要能够适应各种类型的Web页面结构。

基于扩展DOM树的Web页面信息抽取

王磊蒋建中郭军利
（解放军信息工程大学通信工程系河南郑州４００）５０２
摘
要随着ｌｔｍｅ的发展，ｂ页面提供的信息量日益增长，ｎｅｔＷｅ信息的密集程度也不断增强多数Ｗｅｂ页面包含多个信息块，它们布局紧凑，ＨＭＬ语法上具有类似的模式。针对含有多信息块的ｗｅ在Ｔｂ页面提出一种信息抽取的方法：首先创建扩展的ＤＭ０（ｏｕｅｔｂｅｔｄ１树，ＤｃｍｎｊｃＭｏｅ）将页面抽取成离散的信息条；Ｏ然后根据扩展ＤＭ树的层次结构，Ｏ并结合必要的视觉特性和语义信息对
ＡｂｔａｔｓｒｃＷｉｈｅｅｏｍｅｔｆｎｅｔｔｅａｕｔａｅｌｓｔｅｄｎｉｆｎｏｍａｉｎｈｓｉｃｅｅａｙｄｙＭｏｔｆｔｅｔｔｔｅｄｖｌｐｎｔｍｅ，ｍｏｎｓｗｌａｈｅｓｙｏｆｒｔａｎｒａｄｄｙｂａ．ｓｏｉ，ｈｏＩｈｔｉｏｓｈｍｅａ
ｓｎｌｅａｅｃｎａｎｅｅａｎｏｍａｉｎｂｏｋｈｃｒｌｓａｏｔｎａｅｓｍｉｒｍｏｅｉＭＬ￣ａｒ．ｍｅｈｆｎｏａｉｇｅｗｂｐｇｏｔｉｓｓｖｒｌｆｒｔｌｃｓｗｉｈａｅｃｏｅｉｌｙｕｄｈｖｉｌｄｎＨＴｉｏｎａａｍｍａＡｔｏｏｒ — ｄｉｍｆｔｎｅｔａｔｎｉｄｓｇｅｎｄａｉｇｗｔｌｐｅｉｆｒａｉｎｂｏｋｗｂｐｇｓＦｒｔｔｅｄｆｉｏｆａｘｅｄｄＤ０Ｍｒｅｉｐｔｆｒａｄ，ｉｘｒｃｉｓｅｉｎｄｉｅｌｉｍｕｔｌｎｏｏｏｎｈｉｍｔ－ｌｃｅａｅ．ｉｓ，ｈｅｎｔｎｏｎｅｔｎｅｏｉｉｔｅｓｕｏｗｒａｄａｇｖｎｗｂｐｇｉｐｒｅｎｏｐｅｅｆｎｏａｉｎＴｅｂｏｉｉｇｔｅｈｅａｃｙｉｏａｉｎｗｉｈｉｉｎｆａｕｅｎｅｎｎｉｅｅａｅｉｄｓｅｄｉｔｉｃｓｏｒｔ．ｈｎ，ｙｃｍｂｎｎｉｒｒｈｎｒｔｔｔｅｖｓｏｅｔｒｓａｄｓｍａ－ｓｓｉｍｆｏｈｆｍｏｈ

基于WEB资源的信息抽取技术

基于WEB资源的信息抽取技术郭志红（上海交通大学情报研究所，上海200030）摘要 web资源含有大量的有效信息，但由于它们欠结构化，不能为传统的数据库型查询系统所利用。

如何将这些信息抽掏出来，转化成结构化信息，供其它信息集成系统所利用，成为该顶域的研究热点。

本文介绍了一个简单的web信息抽取模型，对基于该模型的wrapper归纳技术进行了探讨，并描述了一个wrapper 自动生成系统的原型。

关键词信息抽取 wrapper归纳技术自动生成原型系统The Technology of Information Extraction for WEBResourceGuo Zhihong（Information Research Institute, Shanghai Jiaotong university, Shanghai 200030）Abstract There is plenty of useful information in web resource. Itcan't be used by the traditional database query system because it is notwell-structured. Recently considerable attention has been received on how to extract it from web resource and transfer it to structured information that can be used by other information integration systems. This paper presents a simple web information extraction model, discusses the technology of wrapper induction based on the model and describes automatic generation prototype system of wrapper.Keywords information extraction wrapper induction automatic generation prototype system引言Internet是一个庞大的信息资源库，它上面有着各类各样的在线信息：天气预报，股票价钱，商品目录，政府法规和税收政策，个人爱好，研究报告等等。

用DoM完成XML数据提取示例

ｃｈｉｌｄｎｏｄｅｓ（ｊ）．ｔｅｘｔ％＞
ｂｚ＝ｌｅｎｄｉｆ
ｎｅｘｔ
单，并保存在所定义的变量中：
ＤＩＭＩｖｎ］ＳｅｔＩｖｎ］＝ｇｅｔＥｌｅｍｅｎｔｓＢＹＴａｇＮａｍｅ（ “ 元素” ）注：［ｖｎ］是自己定义的变量：“ 元素” 是指所加载的ｘＭＬ文
据存储系统、复杂文件和相对不定结构文件的数据存储中，ＸＭＬ
换，同时ＸＭＬ也是数据库之间传送数据的首选方法。另外，在数据结构相对不固定的情况下，ｘＭ嗄成为数据存储的理想方式。获取ＸＭＬ数据的方法有多种，但在大多数据情况下，以ＤＯＭ（ＤｏｃｕｍｅｎｔＯｂｊｅｃｔＭｏｄｅ１）即文档对象模型来处理ＸＭＬ文件是非常可取的。一般情况下，用ＤｏＭ对ｘＭＬ皮档进行处理过程包括以下几个步骤：
３创建ＸＭＬ文档的元素结点清单
因为所创建ＸＭＬ文档的元素结点清单要保存起来以备使用，所以要先定义一个变量，然后用Ｄ０Ｍ的ｇｅｔＥｌｅｍｅｎｔｓＢＹＴａｇＮａｍｅ０方法创建所加载的ｘＭＩ皮档元素的清
还是倍受青睐。因为以ＸＭＬ存储的方式可以方便地实现平台转ｖａｌｕｅ＝确
１创建ＤＯＭ实例
创建一个Ｄ０Ｍ实例通常可以采用下述方法：

基于DOM的Web信息抽取

基于DOM的Web信息抽取
崔继馨;张鹏;杨文柱
【期刊名称】《河北农业大学学报》
【年(卷),期】2005(28)3
【摘要】为解决因Web信息量巨大且具有动态性、不规则性,Web信息查询和Web信息集成存在很大困难,研究了对HTML格式的 Web文档的信息抽取,提出了一种基于DOM的Web信息抽取方法.该方法通过附加语义、样本学习生成基于DOM路径的抽取规则,利用遍历DOM树实现信息抽取.本方法可用于Web查询,也可用于信息集成系统中包装器的构造.
【总页数】4页(P90-93)
【作者】崔继馨;张鹏;杨文柱
【作者单位】河北工程学院,河北,邯郸,056000;河北工程学院,河北,邯郸,056000;河北大学,数学与计算机学院,河北,保定,071002
【正文语种】中文
【中图分类】TP391.1
【相关文献】
1.基于本体和DOM相结合的Web信息抽取器 [J], 柳佳刚;陈山;贺令亚
2.基于单DOM树特征预分类的自适应Web信息抽取方法 [J], 彭艳兵;谢馨庭
3.基于DOM树的可适应性Web信息抽取 [J], 李朝;彭宏;叶苏南;张欢;杨亲遥
4.基于DOM的Web信息抽取方法 [J], 邓箴
5.基于时间频率加权DOM的Web信息抽取方法 [J], 马瑞民;钱浩
因版权原因，仅展示原文概要，查看原文内容请购买。

基于XML的Web半结构化信息抽取

存储形数据清洗；Ｘ；ＴｄＭＬｉｙ中图分类号：Ｔ３１１Ｐ９．文献标识码：Ａ文章编号：１７９７（０７１ｏ６一３６２— ８０２０）０ —０６ｏ
Ｓｍｉ—ＳｒｔｒｄＷｅｎｏｍａｉｎＥｘｒｃｉｎＰｒｃｓｓｄｏｅｔｕｃｕｅｂＩｆｒｔｏｔａｔｏｏｅｓＢａｅｎＸＭＬ
信息抽取是近十几年发展起来的新领域，它起源于文本理解，是自然语言处理领域中比较重要的
一
个子领域。Ｗｅｂ信息抽取承接了传统信息抽取
图１信息抽取及数据处理流程
Ｆｇ１Ｉｆｒｔｎｅｔｃｉｎａｄｄｔａｄｅｐｏｅｓｉ．ｎｏｍａｉｘｒｔｎａａｈｎｌｒｃｓｏａｏ
Ａｂｓｒｃ：ＡｓａｄｖｌｐｎｆｅｔａｔｎｔｃｎｌｇｎＢ／Ｓｆｌｔａｔｅｅｏｍｅｔｏｘｒｃｉｅｈｏｏｙｉｏｉｄ，ｗｅｎｏｍａｉｎｅｔａｔｏｓａｕｎ— ｅｂｉｆｒｔｏｘｒｃｉｎｉｉｖｒａｐｌｃｔｎｆｒｔｅｐｒｏｅｏｔｒｎｎｅｒｅｉｇｉｃｅｓｎｔｍｓｏａａｅｓｌａｐｉａｉｏｈｕｐｓｆｓｏｉｇａｄｒｔｖｎｎｒａｉｇｉｅｒｄｔ．ＴａｉｇＷｅａｅｓｏｉｋｎｂｐｇｓａｏｉｉａａａｒｓｕｃｓ，Ｓｅ —ＳｒｃｕｒｄＷｅｎｏａｉｎＥｘｒｃｉｎＰｏｅｓＢａｅｎＸＭＬｄｓｒｂｄｒｇｎｌｄｔｅｏｒｅｍｉｔｕｔｅｂＩｆｒｔｔａｔｒｃｓｓｄｏｍｏｏｅｃｅｉｉｉｐｒｍａｎｙｒｓａｃｅｉｌｍｅｔｆｔｅｒｃｓｉｅｉｆｒａｉｎｘｒｃｉｎ，ｇｎｒｔｎｏｒｎｔｓｐａｅｉｌｅｅｒｈｄｍｐｅｎｓｏｈｅｅｓｖｎｏｈｍｔｅｔａｔｏｏｅｅａｉｆｍｏｅｏ

基于DOM的网页主题信息的抽取

基于DOM的网页主题信息的抽取
刘军;张净
【期刊名称】《计算机应用与软件》
【年(卷),期】2010(027)005
【摘要】随着Internet的发展,Web页面信息量不断加大,信息密集程度不断加强.但Web页面的主题信息通常不太明确,抽取主题信息也比较困难.针对这一难题,提
出一种算法:构建文档对象模型DOM(Document Object Model)树,然后针对HTML半结构特征的不足,为DOM添加显示、语义(链接数、非链接文字数、高度、宽度)等属性,并提出一种聚类规则来对其进行分块,最后对其进行剪枝,删除掉无用
的信息,提取主题信息.实验表明,该方法能够准确抽取主题信息.
【总页数】3页(P188-190)
【作者】刘军;张净
【作者单位】武汉理工大学计算机科学与技术学院,湖北,武汉,430070;武汉理工大
学计算机科学与技术学院,湖北,武汉,430070
【正文语种】中文
【相关文献】
1.基于DOM的动态网页信息抽取方法 [J], 王平根
2.基于DOM和网页模板的Web信息抽取 [J], 王丽;唐建雄
3.基于DOM和网页模板的Web信息抽取 [J], 王丽;唐建雄
4.基于DOM树和视觉特征的网页信息自动抽取 [J], 黄武冠;朱明;尹文科
5.基于DOM的半结构化网页信息抽取算法 [J], 李卫东
因版权原因，仅展示原文概要，查看原文内容请购买。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

是由于搜索引擎模式简单，精确度低等原因，导致不能很好满足人们对信息的快速搜索．出一种基
于ＸＭＬ技术的Ｗｅｂ信息抽取方法．该方法利用
当今流行的ＨＴＭＬ、ＸＭＬ解析技术，对从互联网中获得的半结构化的Ｗｅｂ页面进行解析，使之成为类似于ＸＭＬ的ＸＨＴＭＬ文档，并且融合ＤＯＭ、
第３４卷第３期２０１３年６月
大连交通大学学报
Ｊ０ＵＲＮＡＬＯＦＤＡＬＩＡＮＪＩＡＯＴＯＮＧＵＮＩＶＥＲＳＩＴＹ
Ｖｏ１．３４Ｎｏ．３
Ｊｕｎ．２０１３
文章编号：１６７３ — ９５９０（２０１３）０３ — ００９６０５
要思想：利用爬虫程序从网上下载具有相关主题
大多数Ｗｅｂ页面是由半结构化的ＨＴＭＬ语言所写，具有很大的异构性，所以很难被计算机归纳和分析．２０世纪８Ｏ年代，Ｗｅｂ信息抽取（Ｉｎｆｏｒｍａ．
ｔｉｏｎＥｘｔｒａｃｔｉｏｎ，ＩＥ）技术就是在这种背景下应运而
的Ｗｅｂ页面作为样本页面，利用ＨＴＭＬ解析器对
页面不规则的标签以及非主题元素进行清理，使之成为符合ＸＭＬ规范的ＸＨＴＭＬ文档．使用ＸＭＬ解析器对文档进行解析，获得对应的ＸＭＬＤＯＭ树．利用ＪＡＶＡ的ＪＴｒｅｅ组件显示文档，方便用户对感兴趣的节点进行标注，通过Ｘｐａｔｈ路径算法获得节点的Ｘｐａｔｈ路径，与转换语言ＸＳＬＴ相结
ＸＰａｔｈ、ＸＳＬＴ以及ＪＴｒｅｅ技术构造Ｗｅｂ信息抽取
模块，建立一个简易通用的Ｗｅｂ信息抽取模型，实现对Ｗｅｂ信息的半自动化抽取．
１Ｗｅｂ信息抽取模型
基于ＸＭＬ和ＤＯＭ技术的Ｗｅｂ信息抽取主
基于ＸＭＬ和ＤＯＭ技术的Ｗｅｂ信息抽取模型
李文，郑邦习，邓武
（大连交通大学软件学院，辽宁大连１１６０２８）米摘要：将ＸＭＬ技术应用于搜索引擎，提出一种基于ＸＭＬ和ＤＯＭ技术的Ｗｅｂ信息抽取模型，对模型的
米收稿日期：２０１２－０９ — １８
基金项目：武汉大学软件工程国家重点实验室开放基金资助项目（ＳＫＬＳＥ２０１２— ９— ２７）；四川省重点实验基金资
助项目（ＧＫ２０１２０２）；广西混杂计算与集成电路设计分析重点实验室基金资助项目
合，最终获得抽取规则模板．通过反复的实验，将得到的规则加入抽取规则库中，利用它对用户请求的页面进行抽取．基于ＸＭＬ技术的Ｗｅｂ信息
抽取模型，如图１所示．
信息自动抽取方法，该方法依据页面信息构造ＳＴＵ．ＤＯＭ，并通过比较页面主题的相关度去除和主题信息不一致的噪声节点，然而缺乏良好的通
数据采集、页面优化处理、抽取规则生成和信息抽取四个阶段进行了详细分析，讨论了网页爬虫、ＮｅｋｏＨＴＭＬ、Ｘｅｒｃｅｓ＿Ｊ、ＪＴｒｅｅ、Ｘｐａｔｈ以及ＸＳＬＴ技术在Ｗｅｂ信息抽取中的应用，实现了Ｗｅｂ信息抽取的半
自动化．
关键词：信息抽取；ＸＭＬ技术；ＤＯＭ技术；Ｗｅｂ页面
文献标识码：Ａ
０引言
当前是一个信息爆炸时代，互联网中充斥着海量的纷乱芜杂的数据，人们想从如此巨大的信息库中查询到自己所要的信息，如同大海捞针，既耗时又耗力．人们提出搜索引擎技术，即通过输入关键字而得到相关信息的页面链接．该技术在很大程度上解决了信息搜索所带来的无尽苦恼，但
作者简介：邓武（１９７６一），男，副教授，博士，主要从事人工智能、信息抽取的研究
Ｅ－ｍａｉｌ：ｄｗ７６８９＠Ｃｔ．．ｅｄｕ．ｃｎ．
生了．国内外对Ｗｅｂ信息抽取已经进行大量相关研究，提出了许多的Ｗｅｂ信息抽取方法 ¨ 剖这
些方法如果按照自动化程度分为手工、半自动和
全自动；按原理可分为基于包装器归纳、基于自然
语言理解、基于Ｏｎｔｏｌｏｇｙ方法和基于ＨＴＭＬ方法等Ｊ．王琦等提出了基于语义ＤＯＭ树的Ｗｅｂ