如何在 Pandas DataFrame 中標準化資料？

在廣闊的資料探索領域，標準化（有時也稱為特徵縮放）作為預處理步驟起著至關重要的作用。它將不同的資料元素轉換為統一的範圍或尺度，從而實現公平的分析和比較。Python 的優秀庫 Pandas 無縫地促進了這一工作。

可以將 Pandas DataFrame 看作是二維的、不斷變化的、異構的表格資料陣列，精心設計用於簡化資料操作。憑藉其直觀的語法和動態功能，它已成為全球資料愛好者的首選結構。讓我們深入探討可以用來標準化 DataFrame 中資料元件的方法。

演算法

在本文中，我們將重點關注在 Pandas DataFrame 中進行資料標準化的以下方法：

a. 利用 sklearn.preprocessing.StandardScaler 的強大功能

b. 利用 pandas.DataFrame.apply 方法和 z-score 的潛力

c. 利用 pandas.DataFrame.subtract 和 pandas.DataFrame.divide 方法的多功能性

d. 探索 pandas.DataFrame.sub 和 pandas.DataFrame.div 方法的深度

語法

在本文中，我們將依賴 pandas 庫，它為我們提供了許多操作 DataFrame 的函式。以下是每種方法語法的簡要概述：

StandardScaler

scaler = StandardScaler()

`StandardScaler` 是 `sklearn.preprocessing` 模組中的一個類，用於透過移除均值並縮放至單位方差來標準化特徵。首先，建立一個 `StandardScaler` 類的例項。

fit_transform()

scaler.fit_transform(X)

`fit_transform()` 方法用於標準化輸入資料 `X`。

apply

df.apply(func, axis=0)

`apply()` 是 Pandas DataFrame 的一種方法，用於沿指定軸（行或列）應用函式。`func` 是要應用的函式，`axis` 是應用函式的軸（0 代表列，1 代表行）。

subtract 和 divide

df.subtract(df.mean()).divide(df.std())

此語法透過減去均值 (`df.mean()`) 併除以每列的標準差 (`df.std()`) 來標準化 Pandas DataFrame。

sub 和 div

df.sub(df.mean()).div(df.std())

下面的程式碼片段演示了執行按元素減法和除法以標準化 Pandas DataFrame 的不同方法。每種方法都使用 sub() 和 div() 方法的不同變體，而不是 subtract() 和 divide()。

這些操作通常用於減去 DataFrame 中每列的均值併除以標準差。

示例

使用 sklearn.preprocessing.StandardScaler

在下面的示例中，我們將：

1. 匯入必要的庫：來自 sklearn 的 StandardScaler、pandas 和 numpy。

2. 建立一個樣本 DataFrame 'df'，其中包含一列 'A'，包含值 1 到 5。

3. 例項化一個 StandardScaler 物件 'scaler'，並使用它透過應用 fit_transform() 方法來規範化列 'A'。

4. 列印更新後的 DataFrame，其中列 'A' 中包含標準化值。

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Construct a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5]
})

# Initialize a scaler
scaler = StandardScaler()

# Fit and transform the data
df['A'] = scaler.fit_transform(np.array(df['A']).reshape(-1, 1))

print(df)

輸出

          A
0 -1.414214
1 -0.707107
2  0.000000
3  0.707107
4  1.414214

使用 pandas.DataFrame.apply 方法和 z-score

在下面的示例中，我們將：

1. 匯入 pandas 庫並建立一個樣本 DataFrame 'df'，其中包含一列 'A'，包含值 1 到 5。

2. 定義一個函式 'standardize'，它接受一列並返回透過減去均值併除以標準差得到的標準化值。

3. 使用 apply() 方法將 'standardize' 函式應用於列 'A'。

4. 列印更新後的 DataFrame，其中列 'A' 中包含標準化值。

import pandas as pd

# Construct a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5]
})

def standardize(column):
    return (column - column.mean()) / column.std()

# Standardize column 'A' using the apply function
df['A'] = df['A'].apply(standardize)

print(df)

輸出

          A
0 -1.414214
1 -0.707107
2  0.000000
3  0.707107
4  1.414214

利用 pandas.DataFrame.subtract 和 pandas.DataFrame.divide 方法

在下面的示例中，我們將：

1. 匯入 pandas 庫並建立一個樣本 DataFrame 'df'，其中包含一列 'A'，包含值 1 到 5。

2. 使用 mean() 和 std() 方法計算列 'A' 的均值和標準差。

3. 使用 subtract() 和 divide() 方法，透過減去均值併除以標準差來標準化列 'A'。

4. 列印更新後的 DataFrame，其中列 'A' 中包含標準化值。

import pandas as pd

# Construct a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5]
})

# Standardize column 'A' using subtract and divide methods
df['A'] = df['A'].subtract(df['A'].mean()).divide(df['A'].std())

print(df)

輸出

          A
0 -1.414214
1 -0.707107
2  0.000000
3  0.707107
4  1.414214

利用 pandas.DataFrame.sub 和 pandas.DataFrame.div 方法

在下面的示例中，我們將：

1. 匯入 pandas 庫並建立一個樣本 DataFrame 'df'，其中包含一列 'A'，包含值 1 到 5。

2. 使用 mean() 和 std() 方法計算列 'A' 的均值和標準差。

3. 使用 sub() 和 div() 方法，透過減去均值併除以標準差來標準化列 'A'。

4. 列印更新後的 DataFrame，其中列 'A' 中包含標準化值。

import pandas as pd

# Construct a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5]
})

# Standardize column 'A' using sub and div methods
df['A'] = df['A'].sub(df['A'].mean()).div(df['A'].std())

print(df)

輸出

          A
0 -1.264911
1 -0.632456
2  0.000000
3  0.632456
4  1.264911

結論

總之，鑑於各種機器學習演算法對其輸入特徵的規模敏感，資料標準化在預處理中起著關鍵作用。選擇合適的標準化方法取決於具體的演算法和資料的性質。當內容遵循正態分佈時，z 分數標準化最合適，而對於未知或非正態分佈，最小-最大歸一化是合適的選擇。但是，在資料相關的工作中做出謹慎的決定需要在選擇特定的縮放方法之前深刻理解資料本身。掌握這些方法的基本原理並在 Python 中掌握它們的實現，為在資料探索的啟蒙之旅中取得進展奠定了堅實的基礎。

Tushar Sharma

更新於：2023年8月28日

3K+ 次瀏覽

啟動您的職業生涯

透過完成課程獲得認證

開始