Python Pandas - 處理 HTML 資料

Pandas 庫提供了廣泛的功能來處理來自各種格式的資料。其中一種格式是 HTML（超文字標記語言），它是一種常用於構建網頁內容的格式。HTML 檔案可能包含表格資料，可以使用 Pandas 庫提取和分析這些資料。

HTML 表格是一種結構化格式，用於在網頁中以行和列的形式表示表格資料。可以透過使用 **pandas.read_html()** 函式從 HTML 中提取此表格資料。也可以使用 **DataFrame.to_html()** 方法將 Pandas DataFrame 寫回 HTML 表格。

在本教程中，我們將學習如何使用 Pandas 處理 HTML 資料，包括讀取 HTML 表格以及將 Pandas DataFrame 寫入 HTML 表格。

從 URL 讀取 HTML 表格

**pandas.read_html()** 函式用於從 HTML 檔案、字串或 URL 讀取表格。它會自動解析 HTML 中的 <table> 元素，並返回一個 **pandas.DataFrame** 物件列表。

示例

以下是使用 **pandas.read_html()** 函式從 URL 讀取資料的基本示例。

import pandas as pd

# Read tables from a SQL tutorial
url = "https://tutorialspoint.tw/sql/sql-clone-tables.htm"
tables = pd.read_html(url)

# Access the first table from the URL
df = tables[0]

# Display the resultant DataFrame
print('Output First DataFrame:', df.head())

以下是上述程式碼的輸出 -

Output First DataFrame:



ID
NAME
AGE
ADDRESS
SALARY


0
1
Ramesh
32
Ahmedabad
2000.0


1
2
Khilan
25
Delhi
1500.0


2
3
Kaushik
23
Kota
2000.0


3
4
Chaitali
25
Mumbai
6500.0


4
5
Hardik
27
Bhopal
8500.0

	ID	NAME	AGE	ADDRESS	SALARY
0	1	Ramesh	32	Ahmedabad	2000.0
1	2	Khilan	25	Delhi	1500.0
2	3	Kaushik	23	Kota	2000.0
3	4	Chaitali	25	Mumbai	6500.0
4	5	Hardik	27	Bhopal	8500.0

從字串讀取 HTML 資料

可以透過使用 Python 的 **io.StringIO** 模組直接從字串讀取 HTML 資料。

示例

以下示例演示瞭如何在不儲存到檔案的情況下使用 StringIO 讀取 HTML 字串。

import pandas as pd
from io import StringIO

# Create a HTML string
html_str = """
<table>
   <tr><th>C1</th><th>C2</th><th>C3</th></tr>
   <tr><td>a</td><td>b</td><td>c</td></tr>
   <tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""

# Read the HTML string
dfs = pd.read_html(StringIO(html_str))
print(dfs[0])

以下是上述程式碼的輸出 -




C1
C2
C3


0
a
b
c


1
x
y
z

	C1	C2	C3
0	a	b	c
1	x	y	z

示例

這是在不使用 **io.StringIO** 模組的情況下讀取 HTML 字串的另一種方法。在這裡，我們將 HTML 字串儲存到一個臨時檔案中，然後使用 **pd.read_html()** 函式讀取它。

import pandas as pd

# Create a HTML string
html_str = """
<table>
   <tr><th>C1</th><th>C2</th><th>C3</th></tr>
   <tr><td>a</td><td>b</td><td>c</td></tr>
   <tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""

# Save to a temporary file and read
with open("temp.html", "w") as f:
    f.write(html_str)

df = pd.read_html("temp.html")[0]
print(df)

以下是上述程式碼的輸出 -




C1
C2
C3


0
a
b
c


1
x
y
z

	C1	C2	C3
0	a	b	c
1	x	y	z

處理來自 HTML 檔案的多個表格

在讀取包含多個表格的 HTML 檔案時，我們可以使用 **pd.read_html()** 函式的 **match** 引數來讀取具有特定文字的表格。

示例

以下示例使用 **match** 引數從包含多個表格的 HTML 檔案中讀取具有特定文字的表格。

import pandas as pd

# Read tables from a SQL tutorial
url = "https://tutorialspoint.tw/sql/sql-clone-tables.htm"
tables = pd.read_html(url, match='Field')

# Access the table
df = tables[0]
print(df.head())

以下是上述程式碼的輸出 -




Field
Type
Null
Key
Default
Extra


1
ID
int(11)
NO
PRI
NaN
NaN


2
NAME
varchar(20)
NO
NaN
NaN
NaN


3
AGE
int(11)
NO
NaN
NaN
NaN


4
ADDRESS
char(25)
YES
NaN
NaN
NaN


5
SALARY
decimal(18,2)
YES
NaN
NaN
NaN

	Field	Type	Null	Key	Default	Extra
1	ID	int(11)	NO	PRI	NaN	NaN
2	NAME	varchar(20)	NO	NaN	NaN	NaN
3	AGE	int(11)	NO	NaN	NaN	NaN
4	ADDRESS	char(25)	YES	NaN	NaN	NaN
5	SALARY	decimal(18,2)	YES	NaN	NaN	NaN

將 DataFrame 寫入 HTML

Pandas DataFrame 物件可以使用 **DataFrame.to_html()** 方法轉換為 HTML 表格。如果 **buf** 引數設定為 None，則此方法將返回一個字串。

示例

以下示例演示瞭如何使用 **DataFrame.to_html()** 方法將 Pandas DataFrame 寫入 HTML 表格。

import pandas as pd

# Create a DataFrame
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])

# Convert the DataFrame to HTML table
html = df.to_html()

# Display the HTML string
print(html)

以下是上述程式碼的輸出 -

<table border="1" class="dataframe">
   <thead>
      <tr style="text-align: right;">
         <th></th>
         <th>A</th>
         <th>B</th>
      </tr>
   </thead>
   <tbody>
      <tr>
         <th>0</th>
         <td>1</td>
         <td>2</td>
      </tr>
      <tr>
         <th>1</th>
         <td>3</td>
         <td>4</td>
      </tr>
   </tbody>
</table>

列印頁面