Python 中對超文字標記語言的支援?
Python 可以透過 html.parser 模組中的 HTMLParser 類處理 HTML 檔案。它可以檢測 HTML 標籤的性質、它們的位置和標籤的許多其他屬性。它還具有可以識別和提取 HTML 檔案中資料的函式。
在下面的示例中,我們瞭解如何使用 HTMLParser 類建立一個自定義解析器類,這個類只能處理我們在類中定義的標籤和資料。這裡我們正在處理起始標籤、結束標籤和資料。
以下是 Python 自定義解析器正在處理的 HTML。
示例
<html> <br> <head> <br> <title>welcome to Tutorials Point!</title> <br> </head> <br> <body> <br> <h1>Learn anything !</h1> <br> </body> <br> </html>
以下是解析上述檔案並根據自定義解析器輸出結果的程式。
示例
from html.parser import HTMLParser import io class Custom_Parser(HTMLParser): def handle_starttag(self, tag, attrs): print("Line and Offset ==", HTMLParser.getpos(self)) print("Encountered a start tag:", tag) def handle_endtag(self, tag): print("Line and Offset ==", HTMLParser.getpos(self)) print("Encountered an end tag :", tag) def handle_data(self, data): print("Line and Offset ==", HTMLParser.getpos(self)) print("Encountered some data :", data) parser = Custom_Parser() stream = io.open("E:\test.html", "r") parser.feed(stream.read())
輸出
執行以上程式碼,我們得到以下結果:
Line and Offset == (1, 0) Encountered a start tag: html Line and Offset == (1, 6) Encountered some data : Line and Offset == (2, 0) Encountered a start tag: head Line and Offset == (2, 6) Encountered some data : Line and Offset == (3, 0) Encountered a start tag: title Line and Offset == (3, 7) Encountered some data : welcome to Tutorials Point! Line and Offset == (3, 34) Encountered an end tag : title Line and Offset == (3, 42) Encountered some data : Line and Offset == (4, 0) Encountered an end tag : head Line and Offset == (4, 7) Encountered some data : Line and Offset == (5, 0) Encountered a start tag: body Line and Offset == (5, 6) Encountered some data : Line and Offset == (6, 0) Encountered a start tag: h1 Line and Offset == (6, 4) Encountered some data : Learn anything ! Line and Offset == (6, 20) Encountered an end tag : h1 Line and Offset == (6, 25) Encountered some data : Line and Offset == (7, 0) Encountered an end tag : body Line and Offset == (7, 7) Encountered some data : Line and Offset == (8, 0) Encountered an end tag : html
廣告