如何在BeautifulSoup中使用XPath？

XPath 是一種強大的查詢語言，用於在 XML 和 HTML 文件中導航和提取資訊。BeautifulSoup 是一個 Python 庫，它提供簡單的方法來解析和操作 HTML 和 XML 文件。將 XPath 的功能與 BeautifulSoup 結合使用可以極大地增強您的網路抓取和資料提取任務。在本文中，我們將瞭解如何有效地在 BeautifulSoup 中使用 XPath。

在BeautifulSoup中使用XPath的演算法

在BeautifulSoup中使用XPath的通用演算法是：

使用合適的解析器將 HTML 文件載入到 BeautifulSoup 中。
使用 find()、find_all()、select_one() 或 select() 方法應用 XPath 表示式。
將 XPath 表示式作為字串傳遞，以及任何所需的屬性或條件。
從 HTML 文件中檢索所需的元素或資訊。

安裝所需的庫

在開始使用 XPath 之前，請確保已安裝 BeautifulSoup 和 lxml 庫。您可以使用以下 pip 命令安裝它們：

pip install beautifulsoup4 lxml

載入HTML文件

讓我們將 HTML 文件載入到 BeautifulSoup 中。此文件將作為我們示例的基礎。假設我們有以下 HTML 結構：

<html>
  <body>
    <div id="content">
      <h1>Welcome to My Website</h1>
      <p>Some text here...</p>
      <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
      </ul>
    </div>
  </body>
</html>

我們可以透過以下程式碼將上述 HTML 載入到 BeautifulSoup 中：

from bs4 import BeautifulSoup

html_doc = '''
<html>
  <body>
    <div id="content">
      <h1>Welcome to My Website</h1>
      <p>Some text here...</p>
      <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
      </ul>
    </div>
  </body>
</html>
'''

soup = BeautifulSoup(html_doc, 'lxml')

基本的XPath語法

XPath 使用類似路徑的語法來定位 XML 或 HTML 文件中的元素。以下是一些基本的 XPath 語法元素：

元素選擇
- 按標籤名選擇元素：//tag_name
- 按屬性選擇元素：//*[@attribute_name='value']
- 按屬性是否存在選擇元素：//*[@attribute_name]
- 按類名選擇元素：//*[contains(@class, 'class_name')]
相對路徑
- 選擇相對於另一個元素的元素：//parent_tag/child_tag
- 選擇任何級別的元素：//ancestor_tag//child_tag
謂詞
- 選擇具有特定索引的元素：(//tag_name)[index]
- 選擇具有特定屬性值的元素：//tag_name[@attribute_name='value']

在BeautifulSoup中使用XPath方法

方法1：find()和find_all()

find() 方法返回第一個匹配的元素，find_all() 方法返回所有匹配元素的列表。

示例

在下面的示例中，我們使用 find() 方法定位 HTML 文件中的第一個 <h1> 標籤，並列印其文字內容。find_all() 方法用於查詢文件中的所有 <li> 標籤，並使用迴圈列印它們的文字內容。

from bs4 import BeautifulSoup

# Loading the HTML Document
html_doc = '''
<html>
  <body>
    <div id="content">
      <h1>Welcome to My Website</h1>
      <p>Some text here...</p>
      <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
      </ul>
    </div>
  </body>
</html>
'''

# Creating a BeautifulSoup object
soup = BeautifulSoup(html_doc, 'lxml')

# Using find() and find_all()
result = soup.find('h1')
print(result.text)  # Output: Welcome to My Website

results = soup.find_all('li')
for li in results:
    print(li.text)

輸出

Welcome to My Website
Item 1
Item 2
Item 3

方法2：select_one()和select()

select_one() 方法返回第一個匹配的元素，select() 方法返回所有匹配元素的列表。

示例

在下面的示例中，我們使用 select_one() 方法選擇具有 ID content 的元素（即 <div id="content">），並將其賦值給 result 變數。列印此元素的文字內容，在本例中為“歡迎來到我的網站”。接下來，select() 方法用於選擇 HTML 文件中的所有 <li> 元素，並將它們賦值給 results 變數。然後使用迴圈遍歷每個 <li> 元素並列印其文字內容。

from bs4 import BeautifulSoup

# Loading the HTML Document
html_doc = '''
<html>
  <body>
    <div id="content">
      <h1>Welcome to My Website</h1>
      <p>Some text here...</p>
      <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
      </ul>
    </div>
  </body>
</html>
'''

# Creating a BeautifulSoup object
soup = BeautifulSoup(html_doc, 'lxml')

# Using select_one() and select()
result = soup.select_one('#content')
print(result.text)  # Output: Welcome to My Website

results = soup.select('li')
for li in results:
    print(li.text)

輸出

Welcome to My Website
Some text here...

Item 1
Item 2
Item 3


Item 1
Item 2
Item 3

方法3：結合find()和find_all()使用XPath

您可以將 XPath 表示式作為字串傳遞給 find() 和 find_all() 方法。

示例

在下面的示例中，我們使用 find() 方法定位第一個 class 屬性設定為 'active' 的 <li> 元素。它將結果賦值給 result 變數並列印它。如果存在這樣的元素，它將被列印；否則，它將顯示 None。接下來，find_all() 方法用於查詢所有 id 屬性設定為 'content' 的 <div> 元素。結果儲存在 results 變數中，並使用迴圈遍歷每個 <div> 元素並列印其文字內容。

from bs4 import BeautifulSoup

# Loading the HTML Document
html_doc = '''
<html>
  <body>
    <div id="content">
      <h1>Welcome to My Website</h1>
      <p>Some text here...</p>
      <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
      </ul>
    </div>
  </body>
</html>
'''

# Creating a BeautifulSoup object
soup = BeautifulSoup(html_doc, 'lxml')

# Using XPath with find() and find_all()
result = soup.find('li', attrs={'class': 'active'})
print(result)  

results = soup.find_all('div', attrs={'id': 'content'})
for div in results:
    print(div.text)

輸出

None

Welcome to My Website
Some text here...

Item 1
Item 2
Item 3

高階XPath表示式

XPath 提供高階表示式來處理複雜的查詢。以下是一些示例：

根據文字內容選擇元素
- 按精確文字匹配選擇元素：//tag_name[text()='value']
- 按部分文字匹配選擇元素：//tag_name[contains(text(), 'value')]
根據位置選擇元素
- 選擇第一個元素：(//tag_name)[1]
- 選擇最後一個元素：(//tag_name)[last()]
- 從第二個元素開始選擇元素：(//tag_name)[position() > 1]
根據屬性值選擇元素
- 選擇屬性以特定值開頭的元素：//tag_name[starts-with(@attribute_name, 'value')]
- 選擇屬性以特定值結尾的元素：//tag_name[ends-with(@attribute_name, 'value')]

結論

在本文中，我們瞭解瞭如何將 XPath 與 BeautifulSoup 結合使用來從複雜的 HTML 結構中提取資料。XPath 是一個強大的工具，用於在 XML 和 HTML 文件中導航和提取資料，而 BeautifulSoup 簡化了在 Python 中解析和操作這些文件的過程。我們可以使用 XPath 和 BeautifulSoup 的功能有效地從複雜的 HTML 結構中提取資料。

Rohan Singh

更新於：2023年10月16日

4K+ 瀏覽量

啟動您的職業生涯

透過完成課程獲得認證

開始學習