Beautiful Soup - 獲取標籤內的文字

HTML 中有兩種型別的標籤。許多標籤都是成對出現的，即開始標籤和結束標籤。頂級 <html> 標籤對應著結束標籤 </html> 是主要示例。其他還有 <body> 和 </body>，<p> 和 </p>，<h1> 和 </h1> 等等。其他標籤是自閉合標籤，例如 <img> 和 <a>。自閉合標籤不像大多數帶有開始和結束符號的標籤（例如 <b>Hello</b>）那樣包含文字。在本章中，我們將瞭解如何使用 Beautiful Soup 庫獲取此類標籤內部的文字部分。

Beautiful Soup 中有多種方法/屬性可用於獲取與標籤物件關聯的文字。

序號	方法及描述
1	text 屬性獲取 PageElement 的所有子字串，如果指定了分隔符，則使用分隔符連線。
2	string 屬性子元素的字串便捷屬性。
3	strings 屬性從當前 PageElement 下的所有子物件生成字串部分。
4	stripped_strings 屬性與 strings 屬性相同，但刪除了換行符和空格。
5	get_text() 方法返回此 PageElement 的所有子字串，如果指定了分隔符，則使用分隔符連線。

考慮以下 HTML 文件：

<div id="outer">
   <div id="inner">
      <p>Hello<b>World</b></p>
      <img src='logo.jpg'>
   </div>
</div>

如果我們檢索已解析的文件樹中每個標籤的 stripped_string 屬性，我們會發現兩個 div 標籤和 p 標籤有兩個 NavigableString 物件，Hello 和 World。<b> 標籤嵌入 world 字串，而 <img> 沒有文字部分。

以下示例從給定 HTML 文件中的每個標籤中獲取文字：

示例

html = """
<div id="outer">
   <div id="inner">
      <p>Hello<b>World</b></p>
      <img src='logo.jpg'>
   </div>
</div>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all():
   print ("Tag: {} attributes: {} ".format(tag.name, tag.attrs))
   for txt in tag.stripped_strings:
      print (txt)
       
   print()

輸出

Tag: div attributes: {'id': 'outer'} 
Hello
World

Tag: div attributes: {'id': 'inner'} 
Hello
World

Tag: p attributes: {} 
Hello
World

Tag: b attributes: {} 
World

Tag: img attributes: {'src': 'logo.jpg'}

列印頁面