機器學習中的類別型資料

什麼是類別型資料？

機器學習中的類別型資料指的是由類別或標籤組成的資料，而不是數值資料。這些類別可能是名義上的，這意味著它們之間沒有固有的順序或等級（例如，顏色、性別），也可能是順序的，這意味著類別之間存在自然順序（例如，教育水平、收入等級）。

類別型資料通常使用離散值表示，例如整數或字串，並且在用作機器學習模型的輸入之前，通常會編碼為獨熱向量。獨熱編碼涉及為每個類別建立一個二進位制向量，其中向量在對應於該類別的位置上為1，在所有其他位置上為0。

處理類別型資料的技巧

處理類別型資料是機器學習預處理的重要組成部分，因為許多演算法都需要數值輸入。根據演算法和類別型資料的性質，可以使用不同的編碼技術，例如標籤編碼、序數編碼或二進位制編碼等。

在本章的後續部分，我們將討論以下處理機器學習中類別型資料的不同技術，以及它們在Python中的實現。

獨熱編碼
標籤編碼
頻率編碼
目標編碼
二進位制編碼

讓我們瞭解上面提到的每種處理機器學習中類別型資料的技術。

1. 獨熱編碼

獨熱編碼是處理機器學習中類別型資料的一種常用技術。它涉及為每個類別建立一個二進位制向量，其中向量的每個元素表示該類別的存在或不存在。例如，如果我們有一個表示顏色的類別變數，其值為紅色、藍色和綠色，則獨熱編碼將分別建立三個二進位制向量：[1, 0, 0]、[0, 1, 0] 和 [0, 0, 1]。

示例

下面是使用Pandas庫在Python中執行獨熱編碼的示例：

import pandas as pd

# Creating a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)

# Performing one-hot encoding
one_hot_encoded = pd.get_dummies(df['color'], prefix='color')

# Combining the encoded data with the original data
df = pd.concat([df, one_hot_encoded], axis=1)

# Drop the original categorical variable
df = df.drop('color', axis=1)

# Print the encoded data
print(df)

輸出

這將建立一個具有三個二進位制變數（“color_blue”、“color_green”和“color_red”）的獨熱編碼資料框，如果存在相應的顏色，則這些變數取值為1，否則取值為0。此編碼資料（下面給出的輸出）可用於機器學習任務，例如分類和迴歸。

      color_blue    color_green    color_red
0        0              0              1
1        0              1              0
2        1              0              0
3        0              0              1
4        0              1              0

獨熱編碼技術適用於小型且有限的類別變數，但對於大型類別變數可能會存在問題，因為它會導致大量的輸入特徵。

2. 標籤編碼

標籤編碼是處理機器學習中類別型資料的另一種技術。它涉及為類別變數中的每個類別分配一個唯一的數值，數值的順序基於類別的順序。

例如，假設我們有一個名為“Size”的類別變數，它有三個類別：“small”、“medium”和“large”。使用標籤編碼，我們將分別為這些類別分配值0、1和2。

示例

下面是使用scikit-learn庫在Python中執行標籤編碼的示例：

from sklearn.preprocessing import LabelEncoder

# create a sample dataset with a categorical variable
data = ['small', 'medium', 'large', 'small', 'large']

# create a label encoder object
label_encoder = LabelEncoder()

# fit and transform the data using the label encoder
encoded_data = label_encoder.fit_transform(data)

# print the encoded data
print(encoded_data)

這將建立一個編碼陣列，其值為[0, 1, 2, 0, 2]，分別對應於編碼類別“small”、“medium”和“large”。請注意，預設情況下，編碼基於類別的字母順序，但可以透過傳遞自定義列表到LabelEncoder物件來更改順序。

輸出

[2 1 0 2 0]

當類別之間存在自然順序時，例如序數類別變數，標籤編碼很有用。但是，對於名義類別變數，應謹慎使用它，因為數值可能會暗示實際上不存在的順序。在這些情況下，獨熱編碼是更安全的選擇。

3. 頻率編碼

頻率編碼是處理機器學習中類別型資料的另一種技術。它涉及用類別在資料集中出現的頻率（或計數）替換類別變數中的每個類別。頻率編碼背後的思想是，更頻繁出現的類別可能對機器學習演算法更重要或更有資訊量。

示例

下面是使用Python執行頻率編碼的示例：

import pandas as pd

# create a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)

# calculate the frequency of each category in the categorical variable
freq = df['color'].value_counts(normalize=True)

# replace each category with its frequency
df['color_freq'] = df['color'].map(freq)

# drop the original categorical variable
df = df.drop('color', axis=1)

# print the encoded data
print(df)

這將建立一個編碼資料框，其中包含一個變數（“color_freq”），表示原始類別變數中每個類別的頻率。例如，如果原始變數有兩個“red”和三個“green”，則相應的頻率將分別為0.4和0.6。

輸出

      color_freq
0        0.4
1        0.4
2        0.2
3        0.4
4        0.4

頻率編碼可能是獨熱編碼或標籤編碼的有用替代方案，尤其是在處理高基數類別變數（即具有大量類別的變數）時。但是，它可能並不總是有效，其效能可能取決於所使用的資料集和機器學習演算法。

4. 目標編碼

目標編碼是處理機器學習中類別型資料的另一種技術。它涉及用該類別的目標變數（即您想要預測的變數）的平均值（或其他聚合）替換類別變數中的每個類別。目標編碼背後的思想是，它可以捕獲類別變數和目標變數之間的關係，從而提高機器學習模型的預測效能。

示例

下面是使用Scikit-learn庫在Python中執行目標編碼的示例，透過結合標籤編碼器和平均編碼器：

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# create a sample dataset with a categorical variable and a target variable
data = {'color': ['red', 'green', 'blue', 'red', 'green'],
   'target': [1, 0, 1, 0, 1]}
df = pd.DataFrame(data)

# create a label encoder object and fit it to the data
label_encoder = LabelEncoder()
label_encoder.fit(df['color'])

# transform the categorical variable using the label encoder
df['color_encoded'] = label_encoder.transform(df['color'])

# create a mean encoder object and fit it to the transformed data
mean_encoder = df.groupby('color_encoded')['target'].mean().to_dict()

# map the mean encoded values to the categorical variable
df['color_encoded'] = df['color_encoded'].map(mean_encoder)

# print the encoded data
print(df)

在此示例中，我們首先建立一個Pandas DataFrame df，其中包含一個類別變數'color'和一個目標變數'target'。然後，我們從scikit-learn建立一個LabelEncoder物件，並將其擬合到df的'color'列。

接下來，我們使用標籤編碼器轉換類別變數'color'，透過在標籤編碼器物件上呼叫transform方法，並將生成的編碼值分配給df中的新列'color_encoded'。

最後，我們透過對'color_encoded'列進行分組並計算每個組的'target'列的平均值來建立一個平均編碼器物件。然後，我們將此平均編碼器物件轉換為字典，並將平均編碼值對映到df的原始'color'列。

輸出

   color     target     color_encoded
0  red        1           0.5
1  green      0           0.5
2  blue       1           1.0
3  red        0           0.5
4  green      1           0.5

目標編碼可能是提高機器學習模型預測效能的強大技術，尤其是在具有高基數類別變數的資料集中。但是，重要的是要避免過度擬合，方法是使用交叉驗證和正則化技術。

5. 二進位制編碼

二進位制編碼是另一種用於在機器學習中對類別變數進行編碼的技術。在二進位制編碼中，每個類別都分配一個二進位制程式碼，其中每個數字表示該類別是否存在（1）或不存在（0）。二進位制程式碼通常基於類別在所有類別排序列表中的位置。

示例

以下是使用category_encoders庫在Python中實現二進位制編碼的示例：

import pandas as pd
import category_encoders as ce

# create a sample dataset with a categorical variable
data = {'color': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)

# create a binary encoder object and fit it to the data
binary_encoder = ce.BinaryEncoder(cols=['color'])
binary_encoder.fit(df['color'])

# transform the categorical variable using the binary encoder
encoded_data = binary_encoder.transform(df['color'])

# merge the encoded variable with the original dataframe
df = pd.concat([df, encoded_data], axis=1)

# print the encoded data
print(df)

在此示例中，我們首先建立一個Pandas DataFrame df，其中包含一個類別變數'color'。然後，我們從category_encoders庫建立一個BinaryEncoder物件，並將其擬合到df的'color'列。

接下來，我們使用二進位制編碼器轉換類別變數'color'，透過在二進位制編碼器物件上呼叫transform方法，並將生成的編碼值分配給新的DataFrame encoded_data。

最後，我們使用concat方法沿列軸（axis=1）將編碼變數與原始DataFrame df合併。結果DataFrame應包含原始'color'列以及編碼的二進位制列。

輸出

執行程式碼後，將產生以下輸出：

   color    color_0    color_1
0   red       0           1
1   green     1           0
2   blue      1           1
3   red       0           1
4   green     1           0

二進位制編碼最適合具有中等數量類別的分類變數，因為對於具有大量類別的變數，它可能會很快變得效率低下。

列印頁面