Python Pandas - 缺失資料

在現實生活中，尤其是在機器學習和資料分析等領域，缺失資料始終是一個問題。缺失值會嚴重影響模型和分析的準確性，因此正確處理它們至關重要。本教程將介紹如何在 Python Pandas 中識別和處理缺失資料。

資料缺失的時機和原因？

考慮一下對某個產品進行線上調查的場景。很多時候，人們不會分享所有與他們相關的資訊，他們可能會跳過一些問題，導致資料不完整。例如，有些人可能會分享他們對產品的體驗，但不會分享他們使用該產品的時間長短，反之亦然。在這樣的即時場景中，缺失資料是經常發生的，有效地處理它至關重要。

在 Pandas 中表示缺失資料

Pandas 使用不同的哨兵值來表示缺失資料 (NA 或 NaN)，具體取決於資料型別。

numpy.nan：用於 NumPy 資料型別。當在整數或布林陣列中引入缺失值時，陣列將升級為np.float64或object，因為NaN是一個浮點值。
NaT：用於 np.datetime64、np.timedelta64 和 PeriodDtype 中缺失的日期和時間。NaT 代表“非時間”。
<NA>：對於 StringDtype、Int64Dtype、Float64Dtype、BooleanDtype 和 ArrowDtype，這是一個更靈活的缺失值表示。當引入缺失值時，此型別會保留原始資料型別。

示例

現在讓我們看看 Pandas 如何表示不同資料型別的缺失資料。

import pandas as pd
import numpy as np

ser1 = pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2])
ser2 = pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2])
ser3 = pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2])

df = pd.DataFrame({'NumPy':ser1, 'Dates':ser2, 'Others':ser3} )
print(df)

其輸出如下：



NumPy
Dates
Others


1.0
1970-01-01 00:00:00.000000001
1


2.0
1970-01-01 00:00:00.000000002
2


NaN
NaT
<NA>

NumPy	Dates	Others
1.0	1970-01-01 00:00:00.000000001	1
2.0	1970-01-01 00:00:00.000000002	2
NaN	NaT	<NA>

檢查缺失值

Pandas 提供了isna()和notna()函式來檢測缺失值，這些函式適用於不同的資料型別。這些函式返回一個布林 Series，指示缺失值的存在。

示例

以下示例使用isna()方法檢測缺失值。

import pandas as pd
import numpy as np

ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT])
print(pd.isna(ser))

執行上述程式碼後，我們將得到以下輸出：

0    False
1     True
dtype: bool

需要注意的是，使用isna()和notna()時，None也被視為缺失值。

使用缺失資料進行計算

當使用缺失資料進行計算時，Pandas 將NA視為零。如果計算中的所有資料都是NA，則結果將為NA。

示例

此示例計算 DataFrame “one”列中包含缺失資料的數值之和。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].sum())

其輸出如下：

2.02357685917

替換/填充缺失資料

Pandas 提供了幾種處理缺失資料的方法。一種常見的方法是使用fillna()方法替換缺失值為特定值。

示例

以下程式演示瞭如何使用fillna()方法將 NaN 替換為標量值（將“NaN”替換為“0”）。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])

df = df.reindex(['a', 'b', 'c'])

print("Input DataFrame:\n",df)
print("Resultant DataFrame after NaN replaced with '0':")
print(df.fillna(0))

其輸出如下：

Input DataFrame:



one
two
three


a
0.188006
-0.685489
-2.088354


b
NaN
NaN
NaN


c
-0.446296
2.298046
0.346000


Resultant DataFrame after NaN replaced with '0':



one
two
three


a
0.188006
-0.685489
-2.088354


b
0.000000
0.000000
0.000000


c
-0.446296
2.298046
0.346000

	one	two	three
a	0.188006	-0.685489	-2.088354
b	NaN	NaN	NaN
c	-0.446296	2.298046	0.346000

	one	two	three
a	0.188006	-0.685489	-2.088354
b	0.000000	0.000000	0.000000
c	-0.446296	2.298046	0.346000

刪除缺失值

如果您想簡單地排除缺失值而不是替換它們，請使用dropna()函式刪除缺失值。

示例

此示例使用dropna()函式刪除缺失值。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())

其輸出如下：




one
two
three


a
0.170497
-0.118334
-1.078715


c
0.326345
-0.180102
0.700032


e
1.972619
-0.322132
-1.405863


f
1.760503
-1.179294
0.043965


h
0.747430
0.235682
0.973310

	one	two	three
a	0.170497	-0.118334	-1.078715
c	0.326345	-0.180102	0.700032
e	1.972619	-0.322132	-1.405863
f	1.760503	-1.179294	0.043965
h	0.747430	0.235682	0.973310

列印頁面