Python 中對超文字標記語言的支援?


Python 可以透過 html.parser 模組中的 HTMLParser 類處理 HTML 檔案。它可以檢測 HTML 標籤的性質、它們的位置和標籤的許多其他屬性。它還具有可以識別和提取 HTML 檔案中資料的函式。

在下面的示例中,我們瞭解如何使用 HTMLParser 類建立一個自定義解析器類,這個類只能處理我們在類中定義的標籤和資料。這裡我們正在處理起始標籤、結束標籤和資料。

以下是 Python 自定義解析器正在處理的 HTML。

示例

<html>
<br>
<head>
<br>
<title>welcome to Tutorials Point!</title>
<br>
</head>
<br>
<body>
<br>
<h1>Learn anything !</h1>
<br>
</body>
<br>
</html>

以下是解析上述檔案並根據自定義解析器輸出結果的程式。

示例

from html.parser import HTMLParser
import io
class Custom_Parser(HTMLParser):
   def handle_starttag(self, tag, attrs):
      print("Line and Offset ==", HTMLParser.getpos(self))
      print("Encountered a start tag:", tag)


   def handle_endtag(self, tag):
      print("Line and Offset ==", HTMLParser.getpos(self))
      print("Encountered an end tag :", tag)


   def handle_data(self, data):
      print("Line and Offset ==", HTMLParser.getpos(self))
      print("Encountered some data :", data)

parser = Custom_Parser()

stream = io.open("E:\test.html", "r")
parser.feed(stream.read())

輸出

執行以上程式碼,我們得到以下結果:

Line and Offset == (1, 0)
Encountered a start tag: html
Line and Offset == (1, 6)
Encountered some data :

Line and Offset == (2, 0)
Encountered a start tag: head
Line and Offset == (2, 6)
Encountered some data :

Line and Offset == (3, 0)
Encountered a start tag: title
Line and Offset == (3, 7)
Encountered some data : welcome to Tutorials Point!
Line and Offset == (3, 34)
Encountered an end tag : title
Line and Offset == (3, 42)
Encountered some data :

Line and Offset == (4, 0)
Encountered an end tag : head
Line and Offset == (4, 7)
Encountered some data :

Line and Offset == (5, 0)
Encountered a start tag: body
Line and Offset == (5, 6)
Encountered some data :

Line and Offset == (6, 0)
Encountered a start tag: h1
Line and Offset == (6, 4)
Encountered some data : Learn anything !
Line and Offset == (6, 20)
Encountered an end tag : h1
Line and Offset == (6, 25)
Encountered some data :

Line and Offset == (7, 0)
Encountered an end tag : body
Line and Offset == (7, 7)
Encountered some data :

Line and Offset == (8, 0)
Encountered an end tag : html

更新於:12-1-2021

220 次瀏覽

啟動你的 職業

透過完成課程取得認證

開始學習吧
廣告