Beautiful Soup - 提取郵箱ID

從網頁中提取電子郵件地址是BeautifulSoup等網頁抓取庫的一個重要應用。在任何網頁中，郵箱ID通常出現在錨標籤<a>的href屬性中。郵箱ID使用mailto URL方案編寫。很多時候，郵箱地址可能作為普通文字（沒有任何超連結）出現在頁面內容中。本章，我們將使用BeautifulSoup庫透過簡單的技術從HTML頁面獲取郵箱ID。

郵箱ID在href屬性中的典型用法如下：

<a href = "mailto:xyz@abc.com">test link</a>

在第一個示例中，我們將考慮以下HTML文件，從中提取超連結中的郵箱ID：

<html>
   <head>
      <title>BeautifulSoup - Scraping Email IDs</title>
   </head>
   <body>
      <h2>Contact Us</h2>
      <ul>
      <li><a href = "mailto:sales@company.com">Sales Enquiries</a></li>
      <li><a href = "mailto:careers@company.com">Careers</a></li>
      <li><a href = "mailto:partner@company.com">Partner with us</a></li>
      </ul>
   </body>
</html>

以下是查詢郵箱ID的Python程式碼。我們收集文件中的所有<a>標籤，並檢查該標籤是否具有href屬性。如果是，則其值從第6個字元之後的部分就是郵箱ID。

from bs4 import BeautifulSoup
import re
fp = open("contact.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all("a")
for tag in tags:
   if tag.has_attr("href") and tag['href'][:7]=='mailto:':
      print (tag['href'][7:])

對於給定的HTML文件，郵箱ID將按如下方式提取：

sales@company.com
careers@company.com
partner@company.com

在第二個示例中，我們假設郵箱ID出現在文字的任何位置。為了提取它們，我們使用正則表示式搜尋機制。正則表示式是一種複雜的字元模式。Python的re模組有助於處理正則表示式模式。以下正則表示式模式用於搜尋郵箱地址：

pat = r'[\w.+-]+@[\w-]+\.[\w.-]+'

在本練習中，我們將使用以下HTML文件，其中郵箱ID位於<li>標籤中。

<html>
   <head>
      <title>BeautifulSoup - Scraping Email IDs</title>
   </head>
   <body>
      <h2>Contact Us</h2>
      <ul>
      <li>Sales Enquiries: sales@company.com</a></li>
      <li>Careers: careers@company.com</a></li>
      <li>Partner with us: partner@company.com</a></li>
      </ul>
   </body>
</html>

使用郵箱正則表示式，我們將找到該模式在每個<li>標籤字串中的出現。以下是Python程式碼：

示例

from bs4 import BeautifulSoup
import re

def isemail(s):
   pat = r'[\w.+-]+@[\w-]+\.[\w.-]+'
   grp=re.findall(pat,s)
   return (grp)

fp = open("contact.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all('li')

for tag in tags:
   emails = isemail(tag.string)
   if emails:
      print (emails)

輸出

['sales@company.com']
['careers@company.com']
['partner@company.com']

使用上述簡單的技術，我們可以使用BeautifulSoup從網頁中提取郵箱ID。

列印頁面