Python Pandas - 刪除缺失資料



在處理現實世界資料集時,缺失資料是一個常見問題。Python Pandas 庫提供了一種簡單的方法,可以使用dropna()方法從資料集中刪除包含缺失值(NaN 或 NaT)的行或列。

Pandas 中的 dropna() 方法是處理缺失資料的有用工具,它可以根據您的特定需求刪除行或列。在本教程中,我們將學習如何使用dropna()根據各種條件刪除缺失資料來清理您的資料集。

dropna() 方法

Pandas 的dropna()方法允許您從 Pandas 資料結構(如 Series 和 DataFrame 物件)中刪除缺失值。它提供了多個選項來自定義您根據 NaN 值的存在方式刪除行或列的方式。此方法返回一個新的 Pandas 物件,其中刪除了缺失資料,或者如果inplace引數設定為True,則返回None

語法

以下是語法:

DataFrame.dropna(*, axis=0, how=<no_default>, thresh=<no_default>, subset=None, inplace=False, ignore_index=False)

其中,

  • axis:0 或 'index'(預設值)刪除行;1 或 'columns' 刪除列。

  • how:預設情況下設定為 'any',如果存在任何缺失值,則刪除該行或列。如果設定為 'all',則如果所有值都缺失,則刪除該行或列。

  • thresh:要求保留行或列的非 NA 值的最小數量。

  • subset:要考慮的特定列(如果刪除行)或行(如果刪除列)的列表。

  • inplace:就地修改 DataFrame(預設為 False)。

  • ignore_index重置結果的索引(預設為 False)。

讓我們探索dropna()方法如何根據各種條件刪除缺失資料。

刪除任何缺失值的行

預設情況下,dropna()方法刪除存在任何缺失值的行。

示例

以下示例使用dropna()方法刪除具有任何缺失值的行。

import pandas as pd
import numpy as np

dataset = {"Student_name": ["Ajay", "Krishna", "Deepak", "Swati"], "Roll_number": [23, 45, np.nan, 18],
           "Major_Subject": ["Maths", "Physics", "Arts", "Political science"], "Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop the rows that have any missing values
df_cleaned = df.dropna()
print('\nResultant DataFrame after removing row:\n',df_cleaned)

以下是上述程式碼的輸出:

Original DataFrame:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2Krishna45.0PhysicsNaN
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN
Resultant DataFrame after removing row:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0

刪除所有值都缺失的行

要刪除所有值都缺失的行,我們需要將how='all'引數設定為dropna()方法。

示例

以下示例演示瞭如何在 DataFrame 中刪除所有值都缺失的行。

import pandas as pd
import numpy as np

dataset = {"Student name": ["Ajay", np.nan, "Deepak", "Swati"], 
"Roll number": [23, np.nan, np.nan, 18],
"Major Subject": ["Maths", np.nan, "Arts", "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop rows where all values are missing
reslut = df.dropna(how='all')
print('\nResultant DataFrame after removing row:\n',reslut)

以下是上述程式碼的輸出:

Original DataFrame:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2NaNNaNNaNNaN
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN
Resultant DataFrame after removing row:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN

保留具有最小數量的缺失值的行

pandas dropan()方法提供thresh引數來指定非缺失值的最小閾值,以便保留具有最小數量的非 Na 值的行。

示例

此示例演示瞭如何保留具有最小數量的缺失值的行。

import pandas as pd
import numpy as np

dataset = {"Student name": ["Ajay", "Krishna", "Deepak", "Swati"], 
"Roll number": [23, np.nan, np.nan, 18],
"Major Subject": ["Maths", np.nan, "Arts", "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop the rows with a threshold 
result = df.dropna(thresh=2)
print('\nResultant DataFrame after removing row:\n',result)

以下是上述程式碼的輸出:

Original DataFrame:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2KrishnaNaNNaNNaN
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN
Resultant DataFrame after removing row:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN

刪除任何缺失值的列

要刪除包含任何缺失值的列,我們可以使用dropna()方法的axis引數來選擇列。

示例

此示例顯示了dropna()方法如何刪除任何值都缺失的整列。

import pandas as pd
import numpy as np

dataset = {"Student_name": ["Ajay", "Krishna", "Deepak", "Swati"], 
"Roll_number": [23, 45, np.nan, 18],
"Major_Subject": ["Maths", "Physics", "Arts", "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop column with any missing values
result = df.dropna(axis='columns')
print('\nResultant DataFrame after removing columns:\n',result)

以下是上述程式碼的輸出:

Original DataFrame:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2Krishna45.0PhysicsNaN
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN
Resultant DataFrame after removing columns:
Student nameMajor Subject
1AjayMaths
2KrishnaPhysics
3DeepakArts
4SwatiPolitical science

根據特定列中的缺失資料刪除行

您可以使用drop()方法的subset引數僅關注那些特定列,同時刪除資料缺失的行。

示例

此示例顯示瞭如何使用dropna()方法的subset引數刪除特定列中存在缺失資料的行。

import pandas as pd
import numpy as np

dataset = {"Student_name": ["Ajay", "Krishna", "Deepak", "Swati"], 
"Roll_number": [23, 45, np.nan, 18],
"Major_Subject": ["Maths", "Physics", np.nan, "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop Rows Based on Missing Data in Specific Columns
result = df.dropna(subset=['Roll_number', 'Major_Subject'])
print('\nResultant DataFrame after removing rows:\n',result)

以下是上述程式碼的輸出:

Original DataFrame:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2Krishna45.0PhysicsNaN
3DeepakNaNNaN98.0
4Swati18.0Political scienceNaN
Resultant DataFrame after removing rows:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2Krishna45.0PhysicsNaN
4Swati18.0Political scienceNaN
廣告