如何使用Python正則表示式從HTML連結中提取URL？

URL是統一資源定位符的首字母縮寫；它用於標識網際網路上的資源位置。例如，以下URL用於標識Google和Microsoft網站的位置：

https://www.google.com
https://www.microsoft.com

URL由域名、路徑、埠號等組成。可以使用正則表示式解析和處理URL。因此，如果要使用正則表示式，則必須在Python中使用re庫。

示例

以下示例演示了URL：

URL: https://tutorialspoint.tw/courses
If we parse the above URL we can find the website name and protocol
Hostname: tutorialspoint.com
Protocol: https

正則表示式

在Python語言中，正則表示式是一種用於查詢匹配字串的搜尋模式。

Python有四種用於正則表示式的函式：

search() - 用於查詢第一個匹配項。
match() - 用於查詢完全相同的匹配項。
findall() - 用於查詢所有匹配項。
sub() - 用於將匹配模式的字串替換為新的字串。

如果要使用Python語言在URL中搜索所需的模式，可以使用re.findall()函式，這是一個re庫函式。

語法

以下是Python中re.findall搜尋函式的語法或用法：

re.findall(regex, string)

上述語法將字串中所有不重疊的模式匹配項作為字串列表返回。

示例

要提取URL，可以使用以下程式碼：

import re
text= '<p>Hello World: </p><a href="https://tutorialspoint.tw">More Courses</a><a href="https://tutorialspoint.tw/market/index.asp">Even More Courses</a>'
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
print("Original string: ",text)
print("Urls:",urls)

輸出

以下是上述程式執行後的輸出：

Original string:  <p>Hello World: </p><a href="https://tutorialspoint.tw">More Courses</a><a href="https://tutorialspoint.tw/market/index.asp">Even More Courses</a>
Urls: ['https://tutorialspoint.tw', 'https://tutorialspoint.tw/market/index.asp']

示例

以下程式演示如何從給定的URL中提取主機名和協議。

import re  
website = 'https://tutorialspoint.tw/'
#to find protocol
object1 = re.findall('(\w+)://', website)
print(object1)
# To find host name
object2 = re.findall('://www.([\w\-\.]+)', website)
print(object2)

輸出

以下是上述程式執行後的輸出：

['https']
['tutorialspoint.com']

示例

以下程式演示了構建路徑元素的通用URL的用法。

# Online Python-3 Compiler (Interpreter)

import re

# url
url = 'https://tutorialspoint.tw/index.html' 

# finding  all capture groups
object = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)', url)
print(object)

輸出

以下是上述程式執行後的輸出：

[('http', 'www.tutorialspoint.com', 'index', 'html')]

Bhanu Priya

更新於：2023年10月4日

2K+ 瀏覽量

啟動您的職業生涯

完成課程後獲得認證

開始學習