Python Pandas - 分類資料的排序

在資料分析中，我們經常需要處理分類資料，尤其是在包含重複字串值（例如國家名稱、性別或評級）的列中。分類資料是指只能取有限數量不同值的數據。例如，國家名稱列中的“印度”、“澳大利亞”值以及性別列中的“男”和“女”值都是分類的。這些值也可以是有序的，允許進行邏輯排序。

分類資料是 Pandas 中用於處理具有固定數量可能值的變數（也稱為“類別”）的資料型別之一。這種型別的資料通常用於統計分析。在本教程中，我們將學習如何使用 Pandas 對分類資料進行排序。

對分類資料進行排序

Pandas 中的有序分類資料是有意義的，允許您執行某些操作，例如排序、min()、max() 和比較。當您嘗試對無序資料應用 min/max 操作時，Pandas 將引發TypeError。Pandas 的.cat訪問器提供as_ordered()方法，用於將分類資料型別轉換為有序型別。

示例

以下示例演示瞭如何使用.cat.as_ordered()方法建立有序分類序列，以及如何在有序分類序列上執行查詢最小值和最大值等操作。

import pandas as pd

# Create a categorical series
s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"]).astype(pd.CategoricalDtype())

# Convert the categorical series into ordered using the .cat.as_ordered() method 
s = s.cat.as_ordered()

# Display the ordered categorical series
print('Ordered Categorical Series:\n',s)

# Perform the minimum and maximum operation on ordered categorical series
print('Minimum value of the categorical series:',s.min())
print('Maximum value of the categorical series:', s.max())

以下是上述程式碼的輸出：

Ordered Categorical Series: 
0    a
1    b
2    c
3    a
4    a
5    a
6    b
7    b
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

Minimum value of the categorical series: a
Maximum value of the categorical series: c

重新排序類別

Pandas 允許您使用.cat.reorder_categories()和.cat.set_categories()方法重新排序或重置分類資料中的類別。

reorder_categories()：此方法用於使用指定的new_categaries重新排序現有類別。
set_categories()：此方法允許您定義一組新的類別，這可能包括新增新類別或刪除現有類別。

示例

以下示例演示瞭如何使用reorder_categories()和set_categories()方法重新排序類別。

import pandas as pd

# Create a categorical series with a specific order
s = pd.Series(["b", "a", "c", "a", "b"], dtype="category")

# Reorder categories using reorder_categories
s_reordered = s.cat.reorder_categories(["b", "a", "c"], ordered=True)

print("Reordered Categories:\n", s_reordered)

# Set new categories using set_categories
s_new_categories = s.cat.set_categories(["d", "b", "a", "c"], ordered=True)

print("\nNew Categories Set:\n", s_new_categories)

以下是上述程式碼的輸出：

Reordered Categories:
0    b
1    a
2    c
3    a
4    b
dtype: category
Categories (3, object): ['b' < 'a' < 'c']

New Categories Set:
0    b
1    a
2    c
3    a
4    b
dtype: category
Categories (4, object): ['d' < 'b' < 'a' < 'c']

對分類資料進行排序

對分類資料進行排序是指根據定義的類別順序按特定順序排列資料。例如，如果您有特定順序的分類資料，例如 ["c", "a", "b"]，則排序將根據此順序排列值。否則，如果您沒有顯式指定順序，則排序可能會按詞法順序（字母順序或數字順序）進行。

示例

以下示例演示了 Pandas 中無序和有序分類資料的排序行為。

import pandas as pd

# Create a categorical series without any specific order
s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"], dtype="category")

# Sort the categorical series without any predefined order (lexical sorting)
print("Lexical Sorting:\n", s.sort_values())

# Define a custom order for the categories
s = s.cat.set_categories(['c', 'a', 'b'], ordered=True)

# Sort the categorical series with the defined order
print("\nSorted with Defined Category Order:\n", s.sort_values())

以下是上述程式碼的輸出：

Lexical Sorting:
0    a
3    a
4    a
5    a
1    b
6    b
7    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

Sorted with Defined Category Order:
2    c
0    a
3    a
4    a
5    a
1    b

使用分類資料的多列排序

如果您的 DataFrame 中有多個分類列，則分類列將與其他列一起排序，其順序將遵循定義的類別。

示例

在此示例中，將建立一個具有兩個分類列“A”和“B”的 DataFrame。然後，首先根據其分類順序按列“A”對 DataFrame 進行排序，然後按列“B”進行排序。

import pandas as pd

# Create a DataFrame with categorical columns
dfs = pd.DataFrame({
"A": pd.Categorical(["X", "X", "Y", "Y", "X", "Z", "Z", "X"], categories=["Y", "Z", "X"], ordered=True),
"B": [1, 2, 1, 2, 2, 1, 2, 1]
})

# Sort by multiple columns
sorted_dfs = dfs.sort_values(by=["A", "B"])

print("Sorted DataFrame:\n", sorted_dfs)

以下是上述程式碼的輸出：

Sorted DataFrame:

A B
2 Y 1
3 Y 2
5 Z 1
6 Z 2
0 X 1
7 X 1
1 X 2
4 X 2

	A	B
2	Y	1
3	Y	2
5	Z	1
6	Z	2
0	X	1
7	X	1
1	X	2
4	X	2

列印頁面