Beautiful Soup中文文档

合集下载

beautiful soup 解析所有表格

beautiful soup 解析所有表格

文章标题:深度解析Beautiful Soup:掌握解析所有表格的技巧1. 前言在网页爬虫和数据抓取的领域中,Beautiful Soup是一款强大的Python库,可以帮助我们解析网页HTML结构,提取有用的信息。

其中,解析网页中的表格数据是非常常见且重要的任务之一。

本文将深度探讨Beautiful Soup如何解析各种类型的表格,以及在解析过程中可能遇到的挑战和解决方案。

2. 简介Beautiful SoupBeautiful Soup是Python中的一个HTML/XML解析库,最初由Leonard Richardson编写。

它可以将复杂的HTML文档转换成一个复杂的树形结构,每个节点都是Python对象。

这使得我们可以用简单的方式来遍历这个树,从而提取出我们想要的信息。

3. 解析基本表格让我们从最基本的表格开始。

在网页上,表格通常使用HTML的<table>标签来表示。

使用Beautiful Soup解析基本表格非常简单,只需使用find()或find_all()方法来找到<table>标签,然后遍历其中的<tr>和<td>标签即可。

通过这种方法,我们可以轻松地获取表格中的数据,并进行进一步的处理。

4. 解析嵌套表格然而,现实中的表格往往并不止是简单的一层结构,而是嵌套、复杂的。

这时,我们需要深入了解Beautiful Soup的递归查找和遍历方法,以应对这种情况。

我们可以编写递归函数来处理嵌套表格,确保我们不会错过任何一层的数据。

5. 解析带有合并单元格的表格有时,网页上的表格会有合并单元格的情况,这会给解析带来一定的困难。

在这种情况下,我们可以借助Beautiful Soup提供的属性和方法,例如rowspan和colspan属性,来识别并处理合并单元格的情况。

6. 解析动态加载的表格随着Web技术的发展,越来越多的网页采用了动态加载的方式来呈现数据。

beautifulsoup爬取用法

beautifulsoup爬取用法

beautifulsoup爬取用法BeautifulSoup 是一个功能强大的Python 库,用于解析和提取HTML 和XML 文件中的数据。

它为开发者提供了一种简单、灵活且优雅的方式来处理网页内容,无论是在网页数据分析、网络爬取还是网页内容提取方面。

下面是关于Beautiful Soup 的用法的一步一步回答。

第一步:安装Beautiful Soup首先,确保你已经安装了Python。

然后,在命令行中使用以下命令安装Beautiful Soup:pip install beautifulsoup4安装完成后,我们就可以开始使用Beautiful Soup 了。

第二步:导入Beautiful Soup在使用Beautiful Soup 之前,我们需要先导入它。

可以使用以下代码导入库:pythonfrom bs4 import BeautifulSoup第三步:获取网页内容使用urllib 或requests 等库,我们可以获取网页内容。

例如,使用requests库的get 方法获取网页内容:pythonimport requestsres = requests.get('html_content = res.text第四步:解析HTML我们需要将获取到的HTML 内容传递给Beautiful Soup,以便解析它。

可以使用以下代码创建一个Beautiful Soup 对象:pythonsoup = BeautifulSoup(html_content, 'html.parser')在这里,'html.parser' 是指定解析器的参数,用于告诉Beautiful Soup 使用哪种解析器。

第五步:从HTML 中提取元素现在,我们已经将网页内容解析成了Beautiful Soup 对象,可以使用它的各种方法和属性来提取想要的元素。

例如,如果想要提取页面中的所有链接,可以使用find_all 方法:pythonlinks = soup.find_all('a')for link in links:print(link['href'])如果只想提取特定标签的内容,可以使用find 或find_all 方法,指定标签名称作为参数:pythontitle = soup.find('h1')print(title.text)此外,还可以通过类名、id、属性等特征来提取元素:python# 通过类名提取元素paragraphs = soup.find_all(class_='paragraph')for p in paragraphs:print(p.text)# 通过id提取元素content = soup.find(id='content')print(content.text)# 通过属性提取元素images = soup.find_all('img', src='image.png')for img in images:print(img['alt'])第六步:处理提取的数据在提取到需要的数据之后,我们可以对其进行各种处理和分析。

BeautifulSoup Python库的中文名称说明书

BeautifulSoup Python库的中文名称说明书

Table of ContentsAbout1 Chapter 1: Getting started with beautifulsoup2 Remarks2 Versions3 Examples3 Installation or Setup3A BeautifulSoup "Hello World" scraping example3 Chapter 2: Locating elements5 Examples5 Locate a text after an element in BeautifulSoup5 Using CSS selectors to locate elements in BeautifulSoup5 Locating comments6 Filter functions6 Basic usage6 Providing additional arguments to filter functions7 Accessing internal tags and their attributes of initially selected tag7 Collecting optional elements and/or their attributes from series of pages7 Credits10AboutYou can share this PDF with anyone you feel could benefit from it, downloaded the latest version from: beautifulsoupIt is an unofficial and free beautifulsoup ebook created for educational purposes. All the content is extracted from Stack Overflow Documentation, which is written by many hardworking individuals at Stack Overflow. It is neither affiliated with Stack Overflow nor official beautifulsoup.The content is released under Creative Commons BY-SA, and the list of contributors to each chapter are provided in the credits section at the end of this book. Images may be copyright of their respective owners unless otherwise specified. All trademarks and registered trademarks are the property of their respective company owners.Use the content presented in this book at your own risk; it is not guaranteed to be correct nor accurate, please send your feedback and corrections to ********************Chapter 1: Getting started with beautifulsoup RemarksIn this section, we discuss what Beautiful Soup is, what it is used for and a brief outline on how to go about using it.Beautiful Soup is a Python library that uses your pre-installed html/xml parser and converts the web page/html/xml into a tree consisting of tags, elements, attributes and values. To be more exact, the tree consists of four types of objects, Tag, NavigableString, BeautifulSoup and Comment. This tree can then be "queried" using the methods/properties of the BeautifulSoup object that is created from the parser library.Your need : Often, you may have one of the following needs :1.You might want to parse a web page to determine, how many of what tags are found, how many elements of each tag are found and their values. You might want to change them.You might want to determine element names and values, so that you can use them in2.conjunction with other libraries for web page automation, such as Selenium.3.You might want to transfer/extract data shown in a web page to other formats, such as aCSV file or to a relational database such as SQLite or mysql. In this case, the library helps you with the first step, of understanding the structure of the web page, although you will be using other libraries to do the act of transfer.4.You might want to find out how many elements are styled with a certain CSS style and which ones.Sequence for typical basic use in your Python code:1.Import the Beautiful Soup libraryOpen a web page or html-text with the BeautifulSoup library, by mentioning which parser to2.be used. The result of this step is a BeautifulSoup object. (Note: This parser namementioned, must be installed already as part of your Python pacakges. For instance,html.parser, is an in-built, 'with-batteries' package shipped with Python. You could installother parsers such as lxml or html5lib. )3."Query" or search the BeautifulSoup object using the syntax 'object.method' and obtain the result into a collection, such as a Python dictionary. For some methods, the output will be a simple value.4.Use the result from the previous step to do whatever you want to do with it, in rest of your Python code. You can also modify the element values or attribute values in the tree object.Modifications don't affect the source of the html code, but you can call output formattingmethods (such as prettify) to create new output from the BeautifulSoup object.Commonly used methods: Typically, the .find and .find_all methods are used to search the tree, giving the input arguments.The input arguments are : the tag name that is being sought, attribute names and other related arguments. These arguments could be presented as : a string, a regular expression, a list or even a function.Common uses of the BeautifulSoup object include :1.Search by CSS class2.Search by Hyperlink address3.Search by Element Id, tag4.Search by Attribute name. Attribute value.If you have a need to filter the tree with a combination of the above criteria, you could also write a function that evaluates to true or false, and search by that function.VersionsExamplesInstallation or Setuppip may be used to install BeautifulSoup. To install Version 4 of BeautifulSoup, run the command: pip install beautifulsoup4Be aware that the package name is beautifulsoup4 instead of beautifulsoup, the latter name stands for old release, see old beautifulsoupA BeautifulSoup "Hello World" scraping examplefrom bs4 import BeautifulSoupimport requestsmain_url = "https:///wiki/Hello_world"req = requests.get(main_url)soup = BeautifulSoup(req.text, "html.parser")# Finding the main title tag.title = soup.find("h1", class_ = "firstHeading")print title.get_text()# Finding the mid-titles tags and storing them in a list.mid_titles = [tag.get_text() for tag in soup.find_all("span", class_ = "mw-headline")]# Now using css selectors to retrieve the article shortcut linkslinks_tags = soup.select("li.toclevel-1")for tag in links_tags:print tag.a.get("href")# Retrieving the side page links by "blocks" and storing them in a dictionaryside_page_blocks = soup.find("div",id = "mw-panel").find_all("div",class_ = "portal")blocks_links = {}for num, block in enumerate(side_page_blocks):blocks_links[num] = [link.get("href") for link in block.find_all("a", href = True)]print blocks_links[0]Output:"Hello, World!" program#Purpose#History#Variations#See_also#References#External_links[u'/wiki/Main_Page', u'/wiki/Portal:Contents', u'/wiki/Portal:Featured_content',u'/wiki/Portal:Current_events', u'/wiki/Special:Random',u'https:///wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&u u'//']Entering your prefered parser when instanciating Beautiful Soup avoids the usual Warningdeclaring that no parser was explicitely specified.Different methods can be used to find an element within the webpage tree.Although a handful of other methods exist, CSS classes and CSS selectors are two handy ways tofind elements in the tree.It should be noted that we can look for tags by setting their attribute value to True when searching them.get_text() allows us to retrieve text contained within a tag. It returns it as a single Unicode string.tag.get("attribute") allows to get a tag's attribute value.Read Getting started with beautifulsoup online:https:///beautifulsoup/topic/1817/getting-started-with-beautifulsoupChapter 2: Locating elementsExamplesLocate a text after an element in BeautifulSoupImagine you have the following HTML:<div><label>Name:</label>John Smith</div>And you need to locate the text "John Smith" after the label element.In this case, you can locate the label element by text and then use .next_sibling property:from bs4 import BeautifulSoupdata = """<div><label>Name:</label>John Smith</div>"""soup = BeautifulSoup(data, "html.parser")label = soup.find("label", text="Name:")print(label.next_sibling.strip())Prints John Smith.Using CSS selectors to locate elements in BeautifulSoupBeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.Basic example:from bs4 import BeautifulSoupdata = """<ul><li class="item">item1</li><li class="item">item2</li><li class="item">item3</li></ul>"""soup = BeautifulSoup(data, "html.parser")for item in soup.select("li.item"):print(item.get_text())Prints:item1item2item3Locating commentsTo locate comments in BeautifulSoup, use the text (or string in the recent versions) argument checking the type to be Comment:from bs4 import BeautifulSoupfrom bs4 import Commentdata = """<html><body><div><!-- desired text --></div></body></html>"""soup = BeautifulSoup(data, "html.parser")comment = soup.find(text=lambda text: isinstance(text, Comment))print(comment)Prints desired text.Filter functionsBeautifulSoup allows you to filter results by providing a function to find_all and similar functions. This can be useful for complex filters as well as a tool for code reuse.Basic usageDefine a function that takes an element as its only argument. The function should return True if the argument matches.def has_href(tag):'''Returns True for tags with a href attribute'''return bool(tag.get("href"))soup.find_all(has_href) #find all elements with a href attribute#equivilent using lambda:soup.find_all(lambda tag: bool(tag.get("href")))Another example that finds tags with a href value that do not start withProviding additional arguments to filter functionsSince the function passed to find_all can only take one argument, it's sometimes useful to make 'function factories' that produce functions fit for use in find_all. This is useful for making your tag-finding functions more flexible.def present_in_href(check_string):return lambda tag: tag.get("href") and check_string in tag.get("href")soup.find_all(present_in_href("/partial/path"))Accessing internal tags and their attributes of initially selected tagLet's assume you got an html after selecting with soup.find('div', class_='base class'):from bs4 import BeautifulSoupsoup = BeautifulSoup(SomePage, 'lxml')html = soup.find('div', class_='base class')print(html)<div class="base class"><div>Sample text 1</div><div>Sample text 2</div><div><a class="ordinary link" href="https://">URL text</a></div></div><div class="Confusing class"></div>'''And if you want to access <a> tag's href, you can do it this way:a_tag = html.alink = a_tag['href']print(link)https://This is useful when you can't directly select <a> tag because it's attrs don't give you unique identification, there are other "twin" <a> tags in parsed page. But you can uniquely select a parent tag which contains needed <a>.Collecting optional elements and/or their attributes from series of pagesLet's consider situation when you parse number of pages and you want to collect value fromelement that's optional (can be presented on one page and can be absent on another) for a paticular page.Moreover the element itself, for example, is the most ordinary element on page, in other words no specific attributes can uniquely locate it. But you see that you can properly select its parent element and you know wanted element's order number in the respective nesting level.from bs4 import BeautifulSoupsoup = BeautifulSoup(SomePage, 'lxml')html = soup.find('div', class_='base class') # Below it refers to html_1 and html_2Wanted element is optional, so there could be 2 situations for html to be:html_1 = '''<div class="base class"> # №0<div>Sample text 1</div> # №1<div>Sample text 2</div> # №2<div>!Needed text!</div> # №3</div><div>Confusing div text</div> # №4'''html_2 = '''<div class="base class"> # №0<div>Sample text 1</div> # №1<div>Sample text 2</div> # №2</div><div>Confusing div text</div> # №4'''If you got html_1 you can collect !Needed text! from tag №3 this way:wanted tag = html_1.div.find_next_sibling().find_next_sibling() # this gives you whole tag №3It initially gets №1 div, then 2 times switches to next div on same nesting level to get to №3.wanted_text = wanted_tag.text # extracting !Needed text!Usefulness of this approach comes when you get html_2 - approach won't give you error, it will give None:print(html_2.div.find_next_sibling().find_next_sibling())NoneUsing find_next_sibling() here is crucial because it limits element search by respective nesting level. If you'd use find_next() then tag №4 will be collected and you don't want it:print(html_2.div.find_next().find_next())<div>Confusing div text</div>You also can explore find_previous_sibling() and find_previous() which work straight opposite way.All described functions have their miltiple variants to catch all tags, not just the first one:find_next_siblings()find_previous_siblings()find_all_next()find_all_previous()Read Locating elements online: https:///beautifulsoup/topic/1940/locating-elementsCredits。

beautifulsoup 使用

beautifulsoup 使用

beautifulsoup 使用BeautifulSoup是一个用于HTML和XML解析的Python 库。

它提供了一种非常简单的方式来遍历和搜索这些文档树,允许您快速地找到所需的信息并将其提取出来。

在这篇文章中,我们将探讨使用BeautifulSoup来进行HTML和XML解析的基本方法和技巧。

BeautifulSoup的安装在开始使用BeautifulSoup之前,您需要确保已成功安装了它。

有几种方法可以安装它,包括使用Python的包管理工具pip或通过下载源代码并手动安装。

在本文中,我们将介绍使用pip来安装BeautifulSoup。

在终端或命令行中,输入以下命令即可安装BeautifulSoup:``` pip install beautifulsoup4 ```完成安装之后,您就可以使用BeautifulSoup解析HTML和XML文件了。

打开HTML文件使用BeautifulSoup打开HTML文件十分简单。

您只需要使用Python的open()函数和BeautifulSoup的构造函数即可。

下面是一个示例:```python from bs4 import BeautifulSoupwith open('example.html') as html_file: soup = BeautifulSoup(html_file, 'html.parser') ```在这个示例中,我们使用了一个名为example.html的HTML文件,并使用BeautifulSoup构造函数将其解析为树形结构。

解析后,我们可以使用BeautifulSoup对象soup 来遍历和搜索HTML文件。

解析HTML文本如果您有一个HTML文本而不是HTML文件,您可以使用BeautifulSoup的构造函数将其解析为树形结构。

下面是一个示例:```python from bs4 import BeautifulSouphtml_text = '<html><body><h1>ExampleHTML</h1><p>This is an example of an HTML document</p></body></html>' soup =BeautifulSoup(html_text, 'html.parser') ```在这个示例中,我们定义了一个HTML文本字符串html_text并使用BeautifulSoup构造函数将其解析为树形结构。

python爬虫数据解析之BeautifulSoup

python爬虫数据解析之BeautifulSoup

python爬⾍数据解析之BeautifulSoup BeautifulSoup是⼀个可以从HTML或者XML⽂件中提取数据的python库。

它能够通过你喜欢的转换器实现惯⽤的⽂档导航,查找,修改⽂档的⽅式。

BeautfulSoup是python爬⾍三⼤解析⽅法之⼀。

⾸先来看个例⼦:from bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="/elsie" class="sister" id="link1">Elsie</a>,<a href="/lacie" class="sister" id="link2">Lacie</a> and<a href="/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc, 'lxml')print(soup.prettify())这个beautiful对象可以按照标准的缩进结构输出。

beautiful soup 使用方法

beautiful soup 使用方法

beautiful soup 使用方法Beautiful Soup is a Python library used for web scraping, which means extracting data from websites. It provides a convenient way to navigate, search, and modify the parse tree of HTML and XML documents. BeautifulSoup 是一个帮助程序员从网页中提取数据的强大工具,它提供了一个简便的方法来遍历、搜索和修改 HTML 和 XML 文档的解析树。

Using Beautiful Soup, you can easily extract information from a webpage by specifying the tags and attributes you want to target.It's often used in combination with other libraries like Requests to make HTTP requests and retrieve the HTML content of a webpage. 使用 Beautiful Soup,你可以通过指定你想要定位的标签和属性轻松地从网页中提取信息。

它通常与 Requests 等其他库一起使用,用于发出 HTTP 请求并获取网页的 HTML 内容。

One of the key features of Beautiful Soup is its ability to convert incoming documents to Unicode and outgoing documents to UTF-8. This makes it easier to work with text in different languages and character encodings. Beautiful Soup 的一个关键特性是它可以将输入的文档转换为 Unicode,将输出的文档转换为 UTF-8。

Python 之 Beautiful Soup 4文档

Python 之 Beautiful Soup 4文档

Python 之Beautiful Soup 4文档(ps:其实入门什么的看官方文档是最好的了,这里只是记录一下简单的用法。

)首先先介绍实际工作中最常用的几个方法:举例的html代码(就用官方例子好了):1<html>2<head>3<title>Page title</title>4</head>5<body>6<p id="firstpara" align="center">7 This is paragraph<b>one</b>.8</p>9<p id="secondpara" align="blah">10 This is paragraph<b>two</b>.11</p>12</body>13</html>0、初始化:1 soup = BeautifulSoup(html) # html为html源代码字符串,type(html) == str1、用tag获取相应代码块的剖析树:既然要分析html,首先要找到对我们有用的tag块,beautiful提供了非常便利的方式。

#当用tag作为搜索条件时,我们获取的包含这个tag块的剖析树:#<tag><xxx>ooo</xxx></tag>#这里获取head这个块head = soup.find('head')# or# head = soup.head# or# head = soup.contents[0].contents[0]运行后,我们会得到:1<head>2<title>Page title</title>3</head>find方法在当前tag剖析树(当前这个html代码块中)寻找符合条件的子树并返回。

beautiful soup解析

beautiful soup解析

beautiful soup解析BeautifulSoup是一种流行的Python库,用于解析HTML和XML文档。

它可以轻松地找到HTML中的结构和内容,并提供了一系列用于处理和提取数据的方法和属性。

在本文中,我们将深入探讨BeautifulSoup的工作原理,包括其基本语法、常用方法和扩展功能。

让我们先了解一下BeautifulSoup的基础知识。

BeautifulSoup是一个HTML 解析器,它使用一种称为“BeautifulSoup算法”的算法来解析HTML文档。

算法会遍历HTML文档,查找所有的标签(也称为“元素”),并将它们转换为BeautifulSoup对象。

BeautifulSoup对象包含了标签的元数据,如标签的类、ID、名称、属性和链接等。

接下来,我们将深入了解BeautifulSoup的常用方法和属性。

这些方法包括: 1. find()方法:该方法用于查找HTML文档中的指定标签或元素。

find()方法返回一个BeautifulSoup对象,其中包含在文档中查找的所有标签或元素。

2. select()方法:该方法用于选择HTML文档中的指定标签或元素。

select()方法返回一个BeautifulSoup对象,其中包含在文档中选择的所有标签或元素。

3. remove()方法:该方法用于从HTML文档中删除指定的标签或元素。

remove()方法返回一个BeautifulSoup对象,其中包含从文档中删除的所有标签或元素。

4. update()方法:该方法用于更新HTML文档中的内容和结构。

update()方法返回一个BeautifulSoup对象,其中包含更新后的文档。

除了常用方法和属性之外,BeautifulSoup还提供了许多扩展功能,如:1. BeautifulSoup.find_all()方法:该方法用于查找HTML文档中的所有指定标签或元素。

find_all()方法返回一个BeautifulSoup对象,其中包含在文档中查找的所有标签或元素。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

几个简单的浏览结构化数据的方法:from bs4 import BeautifulSoup soup = BeautifulSoup (html_doc )print (soup .prettify ())# <html># <head># <title># The Dormouse's story # </title># </head># <body># <p class="title"># <b># The Dormouse's story # </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were # <a class="sister" href="/elsie" id="link1"># Elsie # </a># ,# <a class="sister" href="/lacie" id="link2"># Lacie # </a># and# <a class="sister" href="/tillie" id="link2"># Tillie # </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>从文档中找到所有<a>标签的链接:从文档中获取所有文字内容:soup .title # <title>The Dormouse's story</title>soup .title .name # u'title'soup .title .string# u'The Dormouse's story'soup .title .parent .name # u'head' soup .p# <p class="title"><b>The Dormouse's story</b></p>soup .p ['class']# u'title' soup .a# <a class="sister" href="/elsie" id="link1">Elsie</a>soup .find_all ('a')# [<a class="sister" href="/elsie" id="link1">Elsie</a>,# <a class="sister" href="/lacie" id="link2">Lacie</a>,# <a class="sister" href="/tillie" id="link3">Tillie</a>]soup .find (id ="link3")# <a class="sister" href="/tillie" id="link3">Tillie</a>for link in soup .find_all ('a'): print (link .get ('href')) # /elsie # /lacie # /tillie这是你想要的吗?别着急,还有更好用的安装 Beautiful Soup如果你用的是新版的Debain 或ubuntu,那么可以通过系统的软件包管理来安装:Beautiful Soup 4 通过PyPi 发布,所以如果你无法使用系统包管理安装,那么也可以通过 easy_install 或 pip 来安装.包的名字是 beautifulsoup4 ,这个包兼容Python2和Python3.(在PyPi 中还有一个名字是 BeautifulSoup 的包,但那可能不是你想要的,那是 Beautiful Soup3 的发布版本,因为很多项目还在使用BS3, 所以 BeautifulSoup 包依然有效.但是如果你在编写新项目,那么你应该安装的beautifulsoup4 )如果你没有安装 easy_install 或 pip ,那你也可以 下载BS4的源码 ,然后通过setup.py 来安装.如果上述安装方法都行不通,Beautiful Soup 的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.作者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup 应该在所有当前的Python 版本中正常工作安装完成后的问题Beautiful Soup 发布时打包成Python2版本的代码,在Python3环境下安装时,会自动转换成Python3的代码,如果没有一个安装的过程,那么代码就不会被转换.print (soup .get_text ())# The Dormouse's story ## The Dormouse's story ## Once upon a time there were three little sisters; and their names were # Elsie,# Lacie and # Tillie;# and they lived at the bottom of a well.## ...$ apt ‐get install Python‐bs4$ easy_install beautifulsoup4$ pip install beautifulsoup4$ Python setup.py install如果代码抛出了 ImportError 的异常: “No module named HTMLParser”, 这是因为你在Python3版本中执行Python2版本的代码.如果代码抛出了 ImportError 的异常: “No module named html.parser”, 这是因为你在Python2版本中执行Python3版本的代码.如果遇到上述2种情况,最好的解决方法是重新安装BeautifulSoup4.如果在ROOT_TAG_NAME = u’[document]’代码处遇到 SyntaxError “Invalid syntax”错误,需要将把BS4的Python 代码版本从Python2转换到Python3. 可以重新安装BS4:或在bs4的目录中执行Python 代码版本转换脚本安装解析器Beautiful Soup 支持Python 标准库中的HTML 解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:另一个可供选择的解析器是纯Python 实现的 html5lib , html5lib 的解析方式与浏览器相同,可以选择下列方法来安装html5lib:下表列出了主要的解析器,以及它们的优缺点:$ Python3 setup.py install $ 2to3‐3.2 ‐w bs4$ apt ‐get install Python‐lxml $ easy_install lxml $ pip install lxml$ apt ‐get install Python‐html5lib $ easy_install html5lib $ pip install html5lib解析器使用方法优势劣势Python 标准库BeautifulSoup(markup, "html.parser")Python 的内置标准库 执行速度适中 文档容错能力强Python 2.7.3 or 3.2.2前的版本中文档容错能力差lxml HTML 解析器BeautifulSoup(markup, "lxml")速度快 文档容错能力强需要安装C 语言库lxml XML 解析器BeautifulSoup(markup, ["lxml", "xml"])BeautifulSoup(markup, "xml")速度快 唯一支持XML 的解析器需要安装C 语言库html5libBeautifulSoup(markup, "html5lib")最好的容错性 以浏览器的方式解析文档 生成HTML5格式的文档速度慢 不依赖外部扩展推荐使用lxml 作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml 或html5lib, 因为那些Python 版本的标准库中内置的HTML 解析方法不够稳定.提示: 如果一段HTML 或XML 文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的,查看 解析器之间的区别 了解更多细节如何使用将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.首先,文档被转换成Unicode,并且HTML 的实例都被转换成Unicode 编码然后,Beautiful Soup 选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup 会选择指定的解析器来解析文档.(参考 解析成XML ).对象的种类Beautiful Soup 将复杂HTML 文档转换成一个复杂的树形结构,每个节点都是Python 对象,所有对象可以归纳为4种:Tag , NavigableString , BeautifulSoup , Comment .Tagfrom bs4 import BeautifulSoup soup = BeautifulSoup (open ("index.html"))soup = BeautifulSoup ("<html>data</html>")BeautifulSoup ("Sacr bleu!")<html ><head ></head ><body >Sacré bleu !</body ></html >Tag 对象与XML 或HTML 原生文档中的tag 相同:Tag 有很多方法和属性,在 遍历文档树 和 搜索文档树 中有详细解释.现在介绍一下tag 中最重要的属性: name 和attributesName每个tag 都有自己的名字,通过 .name 来获取:如果改变了tag 的name,那将影响所有通过当前Beautiful Soup 对象生成的HTML 文档:Attributes一个tag 可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . tag 的属性的操作方法与字典相同:也可以直接”点”取属性, 比如: .attrs :tag 的属性可以被添加,删除或修改. 再说一次, tag 的属性操作方法与字典一样soup = BeautifulSoup ('<b class="boldest">Extremely bold</b>')tag = soup .b type (tag )# <class 'bs4.element.Tag'>tag .name # u'b'tag .name = "blockquote"tag# <blockquote class="boldest">Extremely bold</blockquote>tag ['class']# u'boldest'tag .attrs # {u'class': u'boldest'}多值属性HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag 可以有多个CSS 的class). 还有一些属性 rel , rev , accept -charset , headers , accesskey . 在Beautiful Soup 中多值属性的返回类型是list:如果某个属性看起来好像有多个值,但在任何版本的HTML 定义中都没有被定义为多值属性,那么Beautiful Soup 会将这个属性作为字符串返回将tag 转换成字符串时,多值属性会合并为一个值tag ['class'] = 'verybold'tag ['id'] = 1tag# <blockquote class="verybold" id="1">Extremely bold</blockquote>del tag ['class']del tag ['id']tag# <blockquote>Extremely bold</blockquote>tag ['class']# KeyError: 'class'print (tag .get ('class'))# Nonecss_soup = BeautifulSoup ('<p class="body strikeout"></p>')css_soup .p ['class']# ["body", "strikeout"]css_soup = BeautifulSoup ('<p class="body"></p>')css_soup .p ['class']# ["body"]id_soup = BeautifulSoup ('<p id="my id"></p>')id_soup .p ['id']# 'my id'如果转换的文档是XML 格式,那么tag 中不包含多值属性可以遍历的字符串字符串常被包含在tag 内.Beautiful Soup 用 NavigableString 类来包装tag 中的字符串:一个 NavigableString 字符串与Python 中的Unicode 字符串相同,并且还支持包含在 遍历文档树 和 搜索文档树 中的一些特性. 通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode 字符串:tag 中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法:NavigableString 对象支持 遍历文档树 和 搜索文档树 中定义的大部分属性, 并非全部.尤其是,一个字符串不能包含其它内容(tag 能够包含字符串或是其它tag),字符串不支持 .contents 或 .string 属性或 find() 方法.如果想在Beautiful Soup 之外使用 NavigableString 对象,需要调用 unicode() 方法,将该对象转换成普通的Unicode 字符串,否则就算Beautiful Soup 已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.rel_soup = BeautifulSoup ('<p>Back to the <a rel="index">homepage</a></p>')rel_soup .a ['rel']# ['index']rel_soup .a ['rel'] = ['index', 'contents']print (rel_soup .p )# <p>Back to the <a rel="index contents">homepage</a></p>xml_soup = BeautifulSoup ('<p class="body strikeout"></p>', 'xml')xml_soup .p ['class']# u'body strikeout'tag .string # u'Extremely bold'type (tag .string )# <class 'bs4.element.NavigableString'>unicode_string = unicode (tag .string )unicode_string # u'Extremely bold'type (unicode_string )# <type 'unicode'>tag .string .replace_with ("No longer bold")tag# <blockquote>No longer bold</blockquote>BeautifulSoupBeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持 遍历文档树 和 搜索文档树 中描述的大部分的方法.因为 BeautifulSoup 对象并不是真正的HTML 或XML 的tag,所以它没有name 和attribute 属性.但有时查看它的.name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name 注释及特殊字符串Tag , NavigableString , BeautifulSoup 几乎覆盖了html 和xml 中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:Comment 对象是一个特殊类型的 NavigableString 对象:但是当它出现在HTML 文档中时, Comment 对象会使用特殊的格式输出:Beautiful Soup 中定义的其它类型都可能会出现在XML 的文档中: CData , ProcessingInstruction ,Declaration , Doctype .与 Comment 对象类似,这些类都是 NavigableString 的子类,只是添加了一些额外的方法的字符串独享.下面是用CDATA 来替代注释的例子:soup .name # u'[document]'markup = "<b><!‐‐Hey, buddy. Want to buy a used parser?‐‐></b>"soup = BeautifulSoup (markup )comment = soup .b .string type (comment )# <class 'ment'>comment # u'Hey, buddy. Want to buy a used parser'print (soup .b .prettify ())# <b># <!‐‐Hey, buddy. Want to buy a used parser?‐‐># </b>遍历文档树还拿”爱丽丝梦游仙境”的文档来做例子:通过这段例子来演示怎样从文档的一段内容找到另一段内容子节点一个Tag 可能包含多个字符串或其它的Tag,这些都是这个Tag 的子节点.Beautiful Soup 提供了许多操作和遍历子节点的属性.注意: Beautiful Soup 中字符串节点不支持这些属性,因为字符串没有子节点tag 的名字操作文档树最简单的方法就是告诉它你想获取的tag 的name.如果想获取 标签,只要用 soup.head :from bs4 import CData cdata = CData ("A CDATA block")comment .replace_with (cdata )print (soup .b .prettify ())# <b># <![CDATA[A CDATA block]]># </b>html_doc = """<html><head><title>The Dormouse's story</title></head><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="/elsie" class="sister" id="link1">Elsie</a>,<a href="/lacie" class="sister" id="link2">Lacie</a> and<a href="/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""from bs4 import BeautifulSoupsoup = BeautifulSoup (html_doc )这是个获取tag 的小窍门,可以在文档树的tag 中多次调用这个方法.下面的代码可以获取标签中的第一个标签: 通过点取属性的方式只能获得当前名字的第一个tag:如果想要得到所有的标签,或是通过名字得到比一个tag 更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all().contents 和 .childrentag 的 .contents 属性可以将tag 的子节点以列表的方式输出:BeautifulSoup 对象本身一定会包含子节点,也就是说标签也是 BeautifulSoup 对象的子节点:soup .head # <head><title>The Dormouse's story</title></head>soup .title# <title>The Dormouse's story</title>soup .body .b# <b>The Dormouse's story</b>soup .a # <a class="sister" href="/elsie" id="link1">Elsie</a>soup .find_all ('a')# [<a class="sister" href="/elsie" id="link1">Elsie</a>,# <a class="sister" href="/lacie" id="link2">Lacie</a>,# <a class="sister" href="/tillie" id="link3">Tillie</a>]head_tag = soup .head head_tag# <head><title>The Dormouse's story</title></head>head_tag .contents[<title >The Dormouse 's story</title>]title_tag = head_tag .contents [0]title_tag# <title>The Dormouse's story</title>title_tag .contents# [u'The Dormouse's story']。

相关文档
最新文档