如何使用Python解析HTML頁面以獲取HTML表格?
問題
您需要從網頁中提取HTML表格。
介紹
網際網路和全球資訊網(WWW)是當今最重要的資訊來源。資訊如此之多,很難從眾多選項中選擇內容。大部分資訊都可以透過HTTP檢索。
但我們也可以透過程式設計方式執行這些操作,來自動檢索和處理資訊。
Python允許我們使用其標準庫和HTTP客戶端來做到這一點,但requests模組可以更輕鬆地獲取網頁資訊。
在這篇文章中,我們將學習如何解析HTML頁面以提取嵌入在頁面中的HTML表格。
操作方法..
1.我們將使用requests、pandas、beautifulsoup4和tabulate包。如果您的系統缺少這些包,請安裝它們。如果您不確定,請使用pip freeze進行驗證。
import requests import pandas as pd from tabulate import tabulate
2.我們將使用https://tutorialspoint.tw/python/python_basic_operators.htm來解析頁面並打印出嵌入其中的所有HTML表格。
# set the site url site_url = "https://tutorialspoint.tw/python/python_basic_operators.htm"
3.我們將向伺服器發出請求並檢視響應。
# Make a request to the server response = requests.get(site_url) # Check the response print(f"*** The response for {site_url} is {response.status_code}")
4.響應程式碼200表示伺服器的響應成功。因此,我們現在將檢查請求頭、響應頭以及伺服器返回的前100個文字。
# Check the request headers print(f"*** Printing the request headers - \n {response.request.headers} ") # Check the response headers print(f"*** Printing the request headers - \n {response.headers} ") # check the content of the results print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ")
輸出
*** Printing the request headers - {'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} *** Printing the request headers - {'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '213246', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 20 Oct 2020 09:45:18 GMT', 'Expires': 'Thu, 19 Nov 2020 09:45:18 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'} *** Accessing the first 100/37624 characters - <!DOCTYPE html> <html lang="en-US"> <head> <title>Python - Basic Operators - Tutorialspoint</title>
5.我們現在將使用BeautifulSoup解析HTML。
# Parse the HTML pages from bs4 import BeautifulSoup tutorialpoints_page = BeautifulSoup(response.text, 'html.parser') print(f"*** The title of the page is - {tutorialpoints_page.title}") # You can extract the page title as string as well print(f"*** The title of the page is - {tutorialpoints_page.title.string}")
6.大多數表格的標題都定義在h2、h3、h4、h5或h6標籤中。我們首先識別這些標籤,然後拾取識別標籤旁邊的html表格。對於此邏輯,我們將使用如下所示的find、sibling和find_next_siblings。
# Find all the h3 elements print(f"{tutorialpoints_page.find_all('h2')}") tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6") for sibling in tags.find_next_siblings(): if sibling.name == "table": my_table = sibling df = pd.read_html(str(my_table)) print(tabulate(df[0], headers='keys', tablefmt='psql'))
完整程式碼
7.現在將所有內容整合在一起。
# STEP1 : Download the page required import requests import pandas as pd # set the site url site_url = "https://tutorialspoint.tw/python/python_basic_operators.htm" # Make a request to the server response = requests.get(site_url) # Check the response print(f"*** The response for {site_url} is {response.status_code}") # Check the request headers print(f"*** Printing the request headers - \n {response.request.headers} ") # Check the response headers print(f"*** Printing the request headers - \n {response.headers} ") # check the content of the results print(f"*** Accessing the first 100/{len(response.text)} characters - \n\n {response.text[:100]} ") # Parse the HTML pages from bs4 import BeautifulSoup tutorialpoints_page = BeautifulSoup(response.text, 'html.parser') print(f"*** The title of the page is - {tutorialpoints_page.title}") # You can extract the page title as string as well print(f"*** The title of the page is - {tutorialpoints_page.title.string}") # Find all the h3 elements # print(f"{tutorialpoints_page.find_all('h2')}") tags = tutorialpoints_page.find(lambda elm: elm.name == "h2" or elm.name == "h3" or elm.name == "h4" or elm.name == "h5" or elm.name == "h6") for sibling in tags.find_next_siblings(): if sibling.name == "table": my_table = sibling df = pd.read_html(str(my_table)) print(df)
輸出
*** The response for https://tutorialspoint.tw/python/python_basic_operators.htm is 200 *** Printing the request headers - {'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} *** Printing the request headers - {'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Age': '558841', 'Cache-Control': 'max-age=2592000', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Sat, 24 Oct 2020 09:45:13 GMT', 'Expires': 'Mon, 23 Nov 2020 09:45:13 GMT', 'Last-Modified': 'Sat, 17 Oct 2020 22:31:13 GMT', 'Server': 'ECS (meb/A77C)', 'Strict-Transport-Security': 'max-age=63072000; includeSubdomains', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8863'} *** Accessing the first 100/37624 characters - <!DOCTYPE html> <html lang="en-US"> <head> <title>Python - Basic Operators - Tutorialspoint</title> *** The title of the page is - <title>Python - Basic Operators - Tutorialspoint</title> *** The title of the page is - Python - Basic Operators - Tutorialspoint [<h2>Types of Operator</h2>, <h2>Python Arithmetic Operators</h2>, <h2>Python Comparison Operators</h2>, <h2>Python Assignment Operators</h2>, <h2>Python Bitwise Operators</h2>, <h2>Python Logical Operators</h2>, <h2>Python Membership Operators</h2>, <h2>Python Identity Operators</h2>, <h2>Python Operators Precedence</h2>] [ Operator Description \ 0 + Addition Adds values on either side of the operator. 1 - Subtraction Subtracts right hand operand from left hand op... 2 * Multiplication Multiplies values on either side of the operator 3 / Division Divides left hand operand by right hand operand 4 % Modulus Divides left hand operand by right hand operan... 5 ** Exponent Performs exponential (power) calculation on op... 6 // Floor Division - The division of operands wher...
示例
0 a + b = 30 1 a – b = -10 2 a * b = 200 3 b / a = 2 4 b % a = 0 5 a**b =10 to the power 20 6 9//2 = 4 and 9.0//2.0 = 4.0, -11//3 = -4, -11.... ] [ Operator Description \ 0 == If the values of two operands are equal, then ... 1 != If values of two operands are not equal, then ... 2 <> If values of two operands are not equal, then ... 3 > If the value of left operand is greater than t... 4 < If the value of left operand is less than the ... 5 >= If the value of left operand is greater than o... 6 <= If the value of left operand is less than or e...
示例
0 (a == b) is not true. 1 (a != b) is true. 2 (a <> b) is true. This is similar to != operator. 3 (a > b) is not true. 4 (a < b) is true. 5 (a >= b) is not true. 6 (a <= b) is true. ] [ Operator Description \ 0 = Assigns values from right side operands to lef... 1 += Add AND It adds right operand to the left operand and ... 2 -= Subtract AND It subtracts right operand from the left opera... 3 *= Multiply AND It multiplies right operand with the left oper... 4 /= Divide AND It divides left operand with the right operand... 5 %= Modulus AND It takes modulus using two operands and assign... 6 **= Exponent AND Performs exponential (power) calculation on op... 7 //= Floor Division It performs floor division on operators and as...
示例
0 c = a + b assigns value of a + b into c 1 c += a is equivalent to c = c + a 2 c -= a is equivalent to c = c - a 3 c *= a is equivalent to c = c * a 4 c /= a is equivalent to c = c / a 5 c %= a is equivalent to c = c % a 6 c **= a is equivalent to c = c ** a 7 c //= a is equivalent to c = c // a ] [ Operator \ 0 & Binary AND 1 | Binary OR 2 ^ Binary XOR 3 ~ Binary Ones Complement 4 << Binary Left Shift 5 >> Binary Right Shift Description \ 0 Operator copies a bit to the result if it exis... 1 It copies a bit if it exists in either operand. 2 It copies the bit if it is set in one operand ... 3 It is unary and has the effect of 'flipping' b... 4 The left operands value is moved left by the n... 5 The left operands value is moved right by the ...
示例
0 (a & b) (means 0000 1100) 1 (a | b) = 61 (means 0011 1101) 2 (a ^ b) = 49 (means 0011 0001) 3 (~a ) = -61 (means 1100 0011 in 2's complement... 4 a << 2 = 240 (means 1111 0000) 5 a >> 2 = 15 (means 0000 1111) ] [ Operator Description \ 0 and Logical AND If both the operands are true then condition b... 1 or Logical OR If any of the two operands are non-zero then c... 2 not Logical NOT Used to reverse the logical state of its operand. Example 0 (a and b) is true. 1 (a or b) is true. 2 Not(a and b) is false. ] [ Operator Description \ 0 in Evaluates to true if it finds a variable in th... 1 not in Evaluates to true if it does not finds a varia...
示例
0 x in y, here in results in a 1 if x is a membe... 1 x not in y, here not in results in a 1 if x is... ] [ Operator Description \ 0 is Evaluates to true if the variables on either s... 1 is not Evaluates to false if the variables on either ...
示例
0 x is y, here is results in 1 if id(x) equals i... 1 x is not y, here is not results in 1 if id(x) ... ] [ Sr.No. Operator & Description 0 1 ** Exponentiation (raise to the power) 1 2 ~ + - Complement, unary plus and minus (method... 2 3 * / % // Multiply, divide, modulo and floor di... 3 4 + - Addition and subtraction 4 5 >> << Right and left bitwise shift 5 6 & Bitwise 'AND' 6 7 ^ | Bitwise exclusive `OR' and regular `OR' 7 8 <= < > >= Comparison operators 8 9 <> == != Equality operators 9 10 = %= /= //= -= += *= **= Assignment operators 10 11 is is not Identity operators 11 12 in not in]
廣告