使用 Python 中的 Beautiful Soup 提取屬性值

要藉助 Beautiful Soup 提取屬性值，我們需要解析 HTML 文件，然後提取所需的屬性值。Beautiful Soup 是一個用於解析 HTML 和 XML 文件的 Python 庫。Beautiful Soup 提供了幾種方法來搜尋和導航解析樹，從而輕鬆地從文件中提取資料。在本文中，我們將學習如何使用 Python 中的 Beautiful Soup 提取屬性值。

演算法

您可以按照以下演算法使用 Python 中的 Beautiful Soup 提取屬性值。

使用 bs4 庫中的 BeautifulSoup 類解析 HTML 文件。
使用適當的 BeautifulSoup 方法（例如 find() 或 find_all()）查詢包含要提取的屬性的 HTML 元素。
使用條件語句或 has_attr() 方法檢查元素中是否存在該屬性。
如果存在該屬性，則使用方括號 ([]) 和屬性名稱作為鍵來提取其值。
如果不存在該屬性，則相應地處理錯誤。

安裝 Beautiful Soup

在使用 Beautiful Soup 庫之前，您需要使用 Python 包管理器（即 pip 命令）安裝它。要在終端或命令提示符中安裝 Beautiful Soup，請鍵入以下命令。

pip install beautifulsoup4

提取屬性值

要從 HTML 標籤中提取屬性值，我們首先需要使用 BeautifulSoup 解析 HTML 文件。然後使用 Beautiful Soup 方法提取 HTML 文件中特定標籤的屬性值。

示例 1：使用 find() 方法和方括號提取 href 屬性

在下面的示例中，我們首先建立了一個 HTML 文件，並將其作為字串傳遞給 Beautiful Soup 建構函式，並使用解析器型別 html.parser。接下來，我們使用 soup 物件的 find() 方法查詢“a”標籤。這將返回 HTML 文件中“a”標籤的第一次出現。最後，我們使用方括號表示法從“a”標籤中提取 href 屬性的值。這將返回 href 屬性的值作為字串。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <a href="https://www.google.com">Google</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find the 'a' tag
a_tag = soup.find('a')

# Extract the value of the 'href' attribute
href_value = a_tag['href']

print(href_value)

輸出

https://www.google.com

示例 2：使用 attrs 查詢具有特定屬性的元素

在下面的示例中，我們使用 find_all() 方法查詢所有具有 href 屬性的 `a` 標籤。attrs 引數用於指定我們正在查詢的屬性。{‘href’: True} 指定我們正在查詢具有 href 屬性（任何值）的元素。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <a href="https://www.google.com">Google</a>
   <a href="https://python.club.tw">Python</a>
   <a>No Href</a>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find all 'a' tags with an 'href' attribute
a_tags_with_href = soup.find_all('a', attrs={'href': True})
for tag in a_tags_with_href:
   print(tag['href'])

輸出

https://www.google.com
https://python.club.tw

示例 3：使用 find_all() 查詢元素的所有出現

有時，您可能希望查詢網頁上 HTML 元素的所有出現。您可以使用 find_all() 方法實現此目的。在下面的示例中，我們使用 find_all() 方法查詢所有具有類 container 的 div 標籤。然後，我們迴圈遍歷每個 div 標籤，並在其中查詢 h1 和 p 標籤。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <div class="container">
      <h1>Heading 1</h1>
      <p>Paragraph 1</p>
   </div>
   <div class="container">
      <h1>Heading 2</h1>
      <p>Paragraph 2</p>
   </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find all 'div' tags with class='container'
div_tags = soup.find_all('div', class_='container')
for div in div_tags:
   h1 = div.find('h1')
   p = div.find('p')
   print(h1.text, p.text)

輸出

Heading 1 Paragraph 1
Heading 2 Paragraph 2

示例 4：使用 select() 查詢具有 CSS 選擇器的元素

在下面的示例中，我們使用 select() 方法查詢具有類 container 的 div 標籤內的所有 h1 標籤。CSS 選擇器 'div.container h1' 用於實現此目的。. 用於表示類名，而空格用於表示後代選擇器。

from bs4 import BeautifulSoup

# Parse the HTML document
html_doc = """
<html>
<body>
   <div class="container">
      <h1>Heading 1</h1>
      <p>Paragraph 1</p>
   </div>
   <div class="container">
      <h1>Heading 2</h1>
      <p>Paragraph 2</p>
   </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find all 'h1' tags inside a 'div' tag with class='container'
h1_tags = soup.select('div.container h1')
for h1 in h1_tags:
   print(h1.text)

輸出

Heading 1
Heading 2

結論

在本文中，我們討論瞭如何使用 Python 中的 Beautiful Soup 庫從 HTML 文件中提取屬性值。透過使用 BeautifulSoup 提供的方法，我們可以輕鬆地從 HTML 和 XML 文件中提取所需的資料。

Rohan Singh

更新於：2023年7月10日

8K+ 次瀏覽

啟動您的職業生涯

完成課程獲得認證

開始學習