人工智能之編程進階 Python高級：第九章爬蟲類模塊詳情 - python,ide,HTML,Python,後端開發咚咚王哲博客

人工智能之編程進階 Python高級

第九章爬蟲類模塊

(文章目錄)

前言

本文主要敍述網路數據獲取以及網頁解析相關的模塊，掌握此模塊有利於在相關網頁獲取有價值的信息。主要包括以下幾個模塊：

urllib（標準庫）
requests（第三方，最流行）
selenium（瀏覽器自動化）
BeautifulSoup（HTML/XML 解析）
Scrapy（專業爬蟲框架）

🌐 一、`urllib` —— Python 標準庫的 HTTP 客户端

✅ 定位

Python 內置模塊，無需安裝，適合輕量級 HTTP 請求或學習底層原理。

🔧 模塊組成

urllib.request：打開 URL（GET/POST）
urllib.parse：URL 編碼/解析
urllib.error：處理異常
urllib.robotparser：解析 robots.txt

💡 基本用法

1. GET 請求

from urllib import request

url = "https://httpbin.org/get"
with request.urlopen(url) as resp:
    data = resp.read().decode('utf-8')
    print(data)

2. POST 請求（帶參數）

from urllib import request, parse

url = "https://httpbin.org/post"
data = parse.urlencode({'name': 'Alice', 'age': 30}).encode()
req = request.Request(url, data=data)
with request.urlopen(req) as resp:
    print(resp.read().decode())

3. 添加請求頭（模擬瀏覽器）

headers = {'User-Agent': 'Mozilla/5.0'}
req = request.Request(url, headers=headers)
resp = request.urlopen(req)

4. URL 編碼/解碼

from urllib.parse import urlencode, urlparse, parse_qs

# 編碼
params = {'q': '中文', 'page': 1}
encoded = urlencode(params)  # q=%E4%B8%AD%E6%96%87&page=1

# 解析 URL
parsed = urlparse("https://example.com/path?k=v")
print(parsed.query)  # k=v
print(parse_qs(parsed.query))  # {'k': ['v']}

⚠️ 缺點

API 繁瑣（需手動 encode、構造 Request 對象）
不支持會話（Session）、Cookie 自動管理
錯誤處理複雜

✅ 適用場景

不能安裝第三方庫的環境（如某些服務器）
學習 HTTP 原理
簡單腳本（如下載文件）

🚀 二、`requests` —— 最流行的 HTTP 庫

✅ 定位

“人類友好的 HTTP 庫”，簡潔、強大、社區廣泛，是絕大多數項目的首選。

🔧 安裝

pip install requests

💡 基本用法

1. GET / POST

import requests

# GET
resp = requests.get("https://httpbin.org/get", params={'q': 'python'})
print(resp.status_code, resp.json())

# POST
resp = requests.post("https://httpbin.org/post", data={'name': 'Bob'})
print(resp.json())

2. 請求頭 & 超時

headers = {'User-Agent': 'MyBot/1.0'}
resp = requests.get(url, headers=headers, timeout=5)

3. 會話（自動管理 Cookie）

session = requests.Session()
session.headers.update({'User-Agent': 'MyApp'})

# 登錄後自動攜帶 Cookie
session.post(login_url, data=login_data)
profile = session.get(profile_url)  # 已登錄狀態

4. 文件上傳 / 下載

# 上傳文件
with open('photo.jpg', 'rb') as f:
    requests.post(upload_url, files={'file': f})

# 下載大文件（流式）
with requests.get(file_url, stream=True) as r:
    with open('large.zip', 'wb') as f:
        for chunk in r.iter_content(chunk_size=8192):
            f.write(chunk)

5. 異常處理

try:
    resp = requests.get(url, timeout=3)
    resp.raise_for_status()  # 非 2xx 拋出異常
except requests.exceptions.Timeout:
    print("請求超時")
except requests.exceptions.HTTPError as e:
    print("HTTP錯誤:", e)

✅ 高級功能

支持代理：proxies={'http': 'http://10.10.1.10:3128'}
SSL 驗證控制：verify=False（不推薦生產用）
重定向控制：allow_redirects=False
PreparedRequest：預構建請求（用於調試/複用）

✅ 適用場景

API 調用（RESTful）
簡單網頁抓取（靜態內容）
自動化測試
數據採集腳本

🕵️ 三、`selenium` —— 瀏覽器自動化神器

✅ 定位

控制真實瀏覽器（Chrome/Firefox），能執行 JavaScript、處理動態渲染頁面（如 React/Vue 單頁應用）。

🔧 安裝

pip install selenium
# 並下載對應瀏覽器驅動（如 chromedriver）
# 推薦使用 webdriver-manager 自動管理：
pip install webdriver-manager

💡 基本用法

1. 啓動瀏覽器（自動管理驅動）

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://example.com")

# 查找元素
title = driver.find_element(By.TAG_NAME, "h1").text
print(title)

driver.quit()

2. 等待元素加載（關鍵！）

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.ID, "dynamic-content")))

3. 模擬用户操作

# 點擊
button = driver.find_element(By.ID, "submit")
button.click()

# 輸入文本
search_box = driver.find_element(By.NAME, "q")
search_box.send_keys("Python爬蟲")
search_box.submit()

# 執行 JS
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

4. 獲取渲染後的 HTML

html = driver.page_source  # 包含 JS 執行後的完整 DOM

5. 無頭模式（後台運行）

from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")  # 無界面
driver = webdriver.Chrome(options=options)

⚠️ 缺點

速度慢（啓動瀏覽器開銷大）
資源佔用高（每個實例佔幾百 MB 內存）
維護成本高（需匹配瀏覽器與驅動版本）

✅ 適用場景

動態網頁（內容由 JS 加載）
需要登錄/驗證碼/滑塊驗證的網站
自動化測試（UI 測試）
模擬真實用户行為（如點擊、滾動）

🧼 四、`BeautifulSoup` —— HTML/XML 解析利器

✅ 定位

不是網絡請求庫！ 專門用於解析 HTML/XML 文檔，提取結構化數據。

🔧 安裝

pip install beautifulsoup4
# 推薦搭配解析器 lxml（更快更容錯）：
pip install lxml

💡 基本用法

1. 創建解析對象

from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <div class="item">Item 1</div>
    <div class="item">Item 2</div>
    <a rel="nofollow" href="/page2">Next</a>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')  # 或 'html.parser'

2. 查找元素

# find / find_all
divs = soup.find_all('div', class_='item')
for div in divs:
    print(div.text)  # Item 1, Item 2

# CSS 選擇器（推薦）
items = soup.select('div.item')
link = soup.select_one('a[href]')['href']  # /page2

3. 提取屬性與文本

tag = soup.div
print(tag.get_text())      # Item 1
print(tag['class'])        # ['item']
print(tag.attrs)           # {'class': ['item']}

4. 與 requests/selenium 結合

# 從 requests 獲取 HTML
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')

# 從 selenium 獲取 HTML
soup = BeautifulSoup(driver.page_source, 'lxml')

✅ 優勢

語法簡潔直觀（尤其 CSS 選擇器）
容錯性強（能解析不規範 HTML）
支持多種解析器（html.parser, lxml, html5lib）

✅ 適用場景

從 HTML 中提取標題、鏈接、表格等
數據清洗與結構化
配合 requests/selenium 使用

🕷️ 五、`Scrapy` —— 專業級爬蟲框架

✅ 定位

全功能爬蟲框架，支持併發、去重、中間件、管道、分佈式等企業級特性。

🔧 安裝

pip install scrapy

💡 基本項目結構

scrapy startproject myspider
cd myspider
scrapy genspider quotes quotes.toscrape.com

生成目錄：

myspider/
├── scrapy.cfg
└── myspider/
    ├── __init__.py
    ├── items.py        # 定義數據結構
    ├── middlewares.py  # 中間件（請求/響應處理）
    ├── pipelines.py    # 數據處理管道（存數據庫等）
    ├── settings.py     # 配置（UA、併發數、延遲等）
    └── spiders/
        └── quotes.py   # 爬蟲邏輯

💡 核心組件示例

1. 定義 Item（數據模型）

# items.py
import scrapy

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

2. 編寫 Spider

# spiders/quotes.py
import scrapy
from myspider.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuoteItem()
            item['text'] = quote.css('span.text::text').get()
            item['author'] = quote.css('small.author::text').get()
            item['tags'] = quote.css('a.tag::text').getall()
            yield item

        # 翻頁
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

3. 數據管道（存入 JSON/數據庫）

# pipelines.py
class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open('quotes.json', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

4. 啓動爬蟲

scrapy crawl quotes
# 輸出到 JSON
scrapy crawl quotes -o quotes.json

✅ 高級特性

自動限速：AUTOTHROTTLE_ENABLED = True
User-Agent 輪換：通過中間件
去重：內置 Request 去重（基於指紋）
併發控制：CONCURRENT_REQUESTS = 16
中間件：修改請求/響應（如加代理、處理 Cookies）
擴展性強：支持 Redis 分佈式（Scrapy-Redis）

⚠️ 缺點

學習曲線陡峭
不適合簡單腳本（殺雞用牛刀）
動態頁面需結合 Selenium（通過 scrapy-selenium 插件）

✅ 適用場景

大規模數據採集（萬級頁面）
需要長期維護的爬蟲項目
企業級數據抓取系統

🔍 六、五大工具全景對比

工具	類型	是否發請求	是否解析 HTML	是否執行 JS	適用場景
`urllib`	標準庫 HTTP 客户端	✅	❌	❌	簡單請求、教學
`requests`	第三方 HTTP 庫	✅	❌	❌	API 調用、靜態頁抓取
`selenium`	瀏覽器自動化	✅（通過瀏覽器）	❌（需配合解析器）	✅	動態頁、登錄、JS 渲染
`BeautifulSoup`	HTML 解析器	❌	✅	❌	數據提取、清洗
`Scrapy`	爬蟲框架	✅	✅（內置 Selector）	❌（需插件）	大規模、結構化爬蟲

💡 典型組合：

靜態頁面：requests + BeautifulSoup

動態頁面：selenium + BeautifulSoup

大型項目：Scrapy（可集成 selenium 處理動態內容）

✅ 七、選型建議（一句話總結）

只想發個 HTTP 請求？ → 用 requests（除非不能裝第三方，才用 urllib）
頁面內容是 JS 動態加載的？ → 用 selenium
從 HTML 裏抽數據？ → 用 BeautifulSoup（或 Scrapy 的 response.css()/xpath()）
要爬幾萬個頁面，還要去重、存數據庫、自動重試？ → 用 Scrapy

🛡️ 八、法律與道德提醒

遵守 robots.txt（可用 urllib.robotparser 解析）
控制請求頻率（避免 DDoS）
不要爬取隱私/付費/敏感數據
尊重網站版權與服務條款

🌐 合法合規，才是長久之道。

資料關注

公眾號：咚咚王

《Python編程：從入門到實踐》《利用Python進行數據分析》《算法導論中文第三版》《概率論與數理統計（第四版） (盛驟) 》《程序員的數學》《線性代數應該這樣學第3版》《微積分和數學分析引論》《（西瓜書）周志華-機器學習》《TensorFlow機器學習實戰指南》《Sklearn與TensorFlow機器學習實用指南》《模式識別（第四版）》《深度學習 deep learning》伊恩·古德費洛著花書《Python深度學習第二版(中文版)【純文本】 (登封大數據 (Francois Choliet)) (Z-Library)》《深入淺出神經網絡與深度學習+(邁克爾·尼爾森（Michael+Nielsen）》《自然語言處理綜論第2版》《Natural-Language-Processing-with-PyTorch》《計算機視覺-算法與應用(中文版)》《Learning OpenCV 4》《AIGC：智能創作時代》杜雨+&+張孜銘《AIGC原理與實踐：零基礎學大語言模型、擴散模型和多模態模型》《從零構建大語言模型（中文版）》《實戰AI大模型》《AI 3.0》

博客 / 詳情