博客 / 詳情

返回

Python 網頁解析中級篇:深入理解BeautifulSoup庫

在Python的網絡爬蟲中,BeautifulSoup庫是一個重要的網頁解析工具。在初級教程中,我們已經瞭解了BeautifulSoup庫的基本使用方法。在本篇文章中,我們將深入學習BeautifulSoup庫的進階使用。

一、複雜的查找條件

在使用findfind_all方法查找元素時,我們可以使用複雜的查找條件,例如我們可以查找所有class為"story"的p標籤:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

story_p_tags = soup.find_all('p', class_='story')

for p in story_p_tags:
    print(p.string)

二、遍歷DOM樹

在BeautifulSoup中,我們可以方便的遍歷DOM樹,以下是一些常用的遍歷方法:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 獲取直接子節點
for child in soup.body.children:
    print(child)

# 獲取所有子孫節點
for descendant in soup.body.descendants:
    print(descendant)

# 獲取兄弟節點
for sibling in soup.p.next_siblings:
    print(sibling)

# 獲取父節點
print(soup.p.parent)

三、修改DOM樹

除了遍歷DOM樹,我們還可以修改DOM樹,例如我們可以修改tag的內容和屬性:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

soup.p.string = 'New story'
soup.p['class'] = 'new_title'

print(soup.p)

四、解析XML

除了解析HTML外,BeautifulSoup還可以解析XML,我們只需要在創建BeautifulSoup對象時指定解析器為"lxml-xml"即可:

from bs4 import BeautifulSoup

xml_doc = """
<bookstore>
<book category="COOKING">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
</book>
</bookstore>
"""

soup = BeautifulSoup(xml_doc, 'lxml-xml')

print(soup.prettify())

以上就是BeautifulSoup庫的進階使用方法,通過本篇文章,我們可以更好地使用BeautifulSoup庫進行網頁解析,以便更有效地進行網絡爬蟲。

user avatar zhshch 頭像
1 位用戶收藏了這個故事!

發佈 評論

Some HTML is okay.