查詢 Pandas DataFrame 列的分位數和十分位數排名

分位數和十分位數排名是常用的統計指標，用於確定觀測值在資料集中的位置相對於資料集其餘部分的位置。在本篇技術部落格中，我們將探討如何在 Python 中查詢 Pandas DataFrame 列的分位數和十分位數排名。

安裝和語法

pip install pandas

查詢 Pandas DataFrame 列的分位數和十分位數排名的語法如下：

# For finding quantile rank
df['column_name'].rank(pct=True)

# For finding decile rank
df['column_name'].rank(pct=True, method='nearest', bins=10)

演算法

將資料載入到 Pandas DataFrame 中。
選擇要查詢分位數和十分位數排名的列。
使用 **rank()** 方法並將 pct 引數設定為 True 以查詢列中每個觀測值的分位數排名。
使用 rank() 方法並將 pct 引數設定為 True，method 引數設定為 **'nearest'**，並將 **bins 引數設定為** **10** 以查詢列中每個觀測值的十分位數排名。

示例 1

import pandas as pd

# Create a DataFrame
data = {'A': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]}
df = pd.DataFrame(data)

# Find the quantile rank
df['A_quantile_rank'] = df['A'].rank(pct=True)

print(df)

輸出

  A 	 A_quantile_rank
0   1             0.1
1   3             0.3
2   5             0.5
3   7             0.7
4   9             0.9
5  11             0.5
6  13             0.7
7  15             0.9
8  17             1.0
9  19             1.0

建立一個包含 10 個整數的 A 列的 Pandas DataFrame，然後使用 **rank**() 方法（其中 pct 引數設定為 True）查詢 A 列中每個觀測值的分位數排名。我們建立一個新列 **A_quantile_rank** 來儲存分位數排名，並列印結果 DataFrame。

示例 2

import pandas as pd

# Create a DataFrame
data = {'A': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]}
df = pd.DataFrame(data)

# Find the decile rank
n = 10
df['A_decile_rank'] = pd.cut(df['A'], n, labels=range(1, n+1)).astype(int)

print(df)

輸出

    A  A_decile_rank
0   1              1
1   3              2
2   5              3
3   7              4
4   9              5
5  11              6
6  13              7
7  15              8
8  17              9
9  19             10

建立一個包含 10 個整數的 A 列的 Pandas DataFrame。然後，我們使用 **rank**() 方法（其中 pct 引數設定為 True，method 引數設定為 **'nearest'**，bins 引數設定為 10）查詢 A 列中每個觀測值的十分位數排名。我們建立一個新列 **A_decile_rank** 來儲存十分位數排名，並列印結果 DataFrame。

示例 3

import pandas as pd
import numpy as np

# Create a DataFrame
np.random.seed(42)
data = {'A': np.random.normal(0, 1, 1000), 'B': np.random.normal(5, 2, 1000)}
df = pd.DataFrame(data)

# Find the quantile rank of column A
df['A_quantile_rank'] = df['A'].rank(pct=True)

# Find the decile rank of column B
n = 10
df['B_decile_rank'] = pd.cut(df['B'], n, labels=range(1, n+1)).astype(int)

# Print the resulting DataFrame
print(df)

輸出

            A         B  A_quantile_rank  B_decile_rank
0    0.496714  7.798711            0.693              8
1   -0.138264  6.849267            0.436              7
2    0.647689  5.119261            0.750              5
3    1.523030  3.706126            0.929              4
4   -0.234153  6.396447            0.405              6
..        ...       ...              ...            ...
995 -0.281100  7.140300            0.384              7
996  1.797687  4.946957            0.960              5
997  0.640843  3.236251            0.746              4
998 -0.571179  4.673866            0.276              5
999  0.572583  3.510195            0.718              4

[1000 rows x 4 columns]

從一個包含兩列 A 和 B 的 Pandas DataFrame 開始，每列包含 **1000** 個隨機生成的值。然後，我們使用 **rank()** 方法（其中 pct 引數設定為 True）查詢 A 列的分位數排名，並將結果排名儲存在新列 **A_quantile_rank** 中。我們還使用 rank() 方法（其中 pct 引數設定為 True，method 引數設定為 **'nearest'**，bins 引數設定為 10）查詢 B 列的十分位數排名，並將結果排名儲存在新列 **B_decile_rank** 中。最後，我們列印結果 DataFrame。

應用

識別資料集中異常值
對資料集中觀測值進行排名
比較資料集中觀測值

結論

本篇技術部落格探討了如何使用 rank() 方法（其中 pct 引數設定為 True，並使用 method 和 bins 引數修改 rank() 函式的行為）來獲取 Python 中 Pandas DataFrame 列的分位數和十分位數排名。瞭解 Pandas DataFrame 列的分位數和十分位數排名可能有助於資料分析和視覺化，因為它可以更容易地理解資料集的分佈並識別異常值。

Atharva Shah

更新於: 2023年8月21日

505 次檢視

開啟您的職業生涯

透過完成課程獲得認證

開始