Python程式提取HTML標籤之間的字串

HTML標籤用於設計網站的骨架。我們將資訊和上傳內容以包含在標籤內的字串形式傳遞。HTML標籤之間的字串決定了元素如何顯示以及瀏覽器如何解釋。因此，提取這些字串在資料處理和操作中起著至關重要的作用。我們可以分析和理解HTML文件的結構。

這些字串揭示了網頁構建背後的隱藏模式和邏輯。在本文中，我們將處理這些字串。我們的任務是從HTML標籤中提取字串。

理解問題

我們必須提取HTML標籤之間的所有字串。我們的目標字串包含在不同型別的標籤中，並且只應檢索內容部分。讓我們透過一個例子來理解這一點。

輸入輸出場景

讓我們考慮一個字串：

Input:
Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"

輸入字串包含不同的HTML標籤，我們必須提取它們之間的字串。

Output: [" This is a test string,  Let's code together "]

如我們所見，“<h1>”和“<p>”標籤已被移除，並且字串已被提取。既然我們已經理解了這個問題，讓我們討論一些解決方案。

使用迭代和replace()

這種方法側重於消除和替換HTML標籤。我們將傳遞一個字串和一個不同的HTML標籤列表。之後，我們將初始化此字串作為列表的元素。

我們將遍歷標籤列表中的每個元素，並檢查它是否存在於原始字串中。我們將傳遞一個“pos”變數，它將儲存索引值並驅動迭代過程。

我們將使用“replace()”方法將每個標籤替換為空格，並檢索一個無HTML標籤的字串。

示例

以下是提取HTML標籤之間字串的示例：

Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["<h1>", "</h1>", "<p>", "</p>", "<b>", "</b>", "<br>"]
print(f"This is the original string: {Inp_STR}")
ExStr = [Inp_STR]
pos = 0

for tag in tags:
   if tag in ExStr[pos]:
      ExStr[pos] = ExStr[pos].replace(tag, " ")
pos += 1

print(f"The extracted string is : {ExStr}")

輸出

This is the original string: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is : [" This is a test string,  Let's code together "]

使用正則表示式模組+findall()

在這種方法中，我們將使用正則表示式模組來匹配特定模式。我們將傳遞一個正則表示式：“<"+tag+">(.*?)</"+tag+">”，它表示目標模式。此模式旨在捕獲開始和結束標籤。這裡，“tag”是一個變數，它藉助迭代從標籤列表中獲取其值。

“findall()”函式用於查詢原始字串中模式的所有匹配項。我們將使用“extend()”方法將所有“匹配項”新增到新列表中。透過這種方式，我們將提取包含在HTML標籤中的字串。

示例

以下是一個示例：

import re
Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
print(f"This is the original string: {Inp_STR}")
ExStr = []

for tag in tags:
   seq = "<"+tag+">(.*?)</"+tag+">"
   matches = re.findall(seq, Inp_STR)
   ExStr.extend(matches)
print(f"The extracted string is: {ExStr}")

輸出

This is the original string: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is: ['This is a test string,', "Let's code together"]

使用迭代和find()

在這種方法中，我們將使用“find()”方法在原始字串中獲得開始和結束標籤的第一次出現。我們將遍歷標籤列表中的每個元素，並檢索其在字串中的位置。

將使用While迴圈繼續搜尋字串中的HTML標籤。我們將建立一個條件來檢查字串中是否存在不完整的標籤。在每次迭代中，索引值都會更新以查詢開始和結束標籤的下一個出現。

所有開始和結束標籤的索引值都被儲存，一旦整個字串被對映，我們就使用字串切片來提取HTML標籤之間的字串。

示例

以下是一個示例：

Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
ExStr = []
print(f"The original string is: {Inp_STR}")

for tag in tags:
   tagpos1 = Inp_STR.find("<"+tag+">")
   while tagpos1 != -1:
      tagpos2 = Inp_STR.find("</"+tag+">", tagpos1)
      if tagpos2 == -1:
         break
      ExStr.append(Inp_STR[tagpos1 + len(tag)+2: tagpos2])
      tagpos1 = Inp_STR.find("<"+tag+">", tagpos2)

print(f"The extracted string is: {ExStr}")

輸出

The original string is: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is: ['This is a test string,', "Let's code together"]

結論

在本文中，我們討論了提取HTML標籤之間字串的多種方法。我們從更簡單的解決方案開始，即查詢並將標籤替換為空格。我們還使用了正則表示式模組及其findall()函式來查詢模式的匹配項。我們也理解了find()方法和字串切片的應用。

Devesh Chauhan

更新於：2023年7月12日

981 次瀏覽

開啟您的職業生涯

完成課程獲得認證

開始學習