Python Pandas - 插值處理缺失值

插值是 Pandas 中一種強大的技術，用於處理資料集中的缺失值。此技術根據資料集的其他資料點估算缺失值。Pandas 為 DataFrame 和 Series 物件都提供了 **interpolate()** 方法，可以使用各種插值方法填充缺失值。

在本教程中，我們將學習 Pandas 中的 **interpolate()** 方法，使用不同的插值方法填充時間序列資料、數值資料等中的缺失值。

基本插值

DataFrame 和 Series 物件的 Pandas **interpolate()** 方法用於使用不同的插值策略填充缺失值。預設情況下，Pandas 自動使用線性插值作為預設方法。

示例

這是一個呼叫 **interpolate()** 方法填充缺失值的簡單示例。

import numpy as np
import pandas as pd

df = pd.DataFrame({"A": [1.1, np.nan, 3.5, np.nan, np.nan, np.nan, 6.2, 7.9],
"B": [0.25, np.nan, np.nan, 4.7, 10, 14.7, 1.3, 9.2],
})

print("Original DataFrame:")
print(df)

# Using the  interpolate() method
result = df.interpolate()
print("\nResultant DataFrame after applying the interpolation:")
print(result)

以下是上述程式碼的輸出 -

Original DataFrame:

A B
0 1.1 0.25
1 NaN NaN
2 3.5 NaN
3 NaN 4.70
4 NaN 10.00
5 NaN 14.70
6 6.2 1.30
7 7.9 9.20


Resultant DataFrame after applying the interpolation:

A B
0 1.100 0.250000
1 2.300 1.733333
2 3.500 3.216667
3 4.175 4.700000
4 4.850 10.000000
5 5.525 14.700000
6 6.200 1.300000
7 7.900 9.200000

	A	B
0	1.1	0.25
1	NaN	NaN
2	3.5	NaN
3	NaN	4.70
4	NaN	10.00
5	NaN	14.70
6	6.2	1.30
7	7.9	9.20

	A	B
0	1.100	0.250000
1	2.300	1.733333
2	3.500	3.216667
3	4.175	4.700000
4	4.850	10.000000
5	5.525	14.700000
6	6.200	1.300000
7	7.900	9.200000

不同的插值方法

Pandas 支援多種插值方法，包括線性、多項式、pchip、akima、spline 等。這些方法為根據資料的性質填充缺失值提供了靈活性。

示例

以下示例演示了使用 **interpolate()** 方法和 **barycentric** 插值技術。

import numpy as np
import pandas as pd

df = pd.DataFrame({"A": [1.1, np.nan, 3.5, np.nan, np.nan, np.nan, 6.2, 7.9],
"B": [0.25, np.nan, np.nan, 4.7, 10, 14.7, 1.3, 9.2],
})

print("Original DataFrame:")
print(df)

# Applying the interpolate() with Barycentric method
result = df.interpolate(method='barycentric')

print("\nResultant DataFrame after applying the interpolation:")
print(result)

以下是上述程式碼的輸出 -

Original DataFrame:

i A B
0 1.1 0.25
1 NaN NaN
2 3.5 NaN
3 NaN 4.70
4 NaN 10.00
5 NaN 14.70
6 6.2 1.30
7 7.9 9.20

Resultant DataFrame after applying the interpolation:

A B
0 1.100000 0.250000
1 2.596429 57.242857
2 3.500000 24.940476
3 4.061429 4.700000
4 4.531429 10.000000
5 5.160714 14.700000
6 6.200000 1.300000
7 7.900000 9.200000

i	A	B
0	1.1	0.25
1	NaN	NaN
2	3.5	NaN
3	NaN	4.70
4	NaN	10.00
5	NaN	14.70
6	6.2	1.30
7	7.9	9.20

	A	B
0	1.100000	0.250000
1	2.596429	57.242857
2	3.500000	24.940476
3	4.061429	4.700000
4	4.531429	10.000000
5	5.160714	14.700000
6	6.200000	1.300000
7	7.900000	9.200000

處理插值中的限制

預設情況下，Pandas 插值填充所有缺失值，但是您可以使用 **interpolate()** 方法的 **limit** 引數限制填充多少個連續的 NaN 值。

示例

以下示例演示了透過使用 **interpolate()** 方法的 **limit** 引數限制連續填充來填充 Pandas DataFrame 的缺失值。

import numpy as np
import pandas as pd

df = pd.DataFrame({"A": [1.1, np.nan, 3.5, np.nan, np.nan, np.nan, 6.2, 7.9],
"B": [0.25, np.nan, np.nan, 4.7, 10, 14.7, 1.3, 9.2],
})

print("Original DataFrame:")
print(df)

# Applying the interpolate() with limit
result = df.interpolate(method='spline', order=2, limit=1)

print("\nResultant DataFrame after applying the interpolation:")
print(result)

以下是上述程式碼的輸出 -

Original DataFrame:

i A B
0 1.1 0.25
1 NaN NaN
2 3.5 NaN
3 NaN 4.70
4 NaN 10.00
5 NaN 14.70
6 6.2 1.30
7 7.9 9.20


Resultant DataFrame after applying the interpolation:

i A B
0 1.100000 0.250000
1 2.231383 -1.202052
2 3.500000 NaN
3 4.111529 4.700000
4 NaN 10.000000
5 NaN 14.700000
6 6.200000 1.300000
7 7.900000 9.200000

i	A	B
0	1.1	0.25
1	NaN	NaN
2	3.5	NaN
3	NaN	4.70
4	NaN	10.00
5	NaN	14.70
6	6.2	1.30
7	7.9	9.20

i	A	B
0	1.100000	0.250000
1	2.231383	-1.202052
2	3.500000	NaN
3	4.111529	4.700000
4	NaN	10.000000
5	NaN	14.700000
6	6.200000	1.300000
7	7.900000	9.200000

時間序列資料的插值

插值也可以應用於 Pandas 時間序列資料。在填充隨時間推移缺失資料點的間隙時，這很有用。

示例

示例語句 -

import numpy as np
import pandas as pd

indx = pd.date_range("2024-01-01", periods=10, freq="D")
data = np.random.default_rng(2).integers(0, 10, 10).astype(np.float64)
s = pd.Series(data, index=indx)
s.iloc[[1, 2, 5, 6, 9]] = np.nan

print("Original Series:")
print(s)

result = s.interpolate(method="time")

print("\nResultant Time Series after applying the interpolation:")
print(result)

以下是上述程式碼的輸出 -

Original Series:

Date Value
2024-01-01 8.0
2024-01-02 NaN
2024-01-03 NaN
2024-01-04 2.0
2024-01-05 4.0
2024-01-06 NaN
2024-01-07 NaN
2024-01-08 0.0
2024-01-09 3.0
2024-01-10 NaN


Resultant Time Series after applying the interpolation:

Date Value
2024-01-01 8.000000
2024-01-02 6.000000
2024-01-03 4.000000
2024-01-04 2.000000
2024-01-05 4.000000
2024-01-06 2.666667
2024-01-07 1.333333
2024-01-08 0.000000
2024-01-09 3.000000
2024-01-10 3.000000

Date	Value
2024-01-01	8.0
2024-01-02	NaN
2024-01-03	NaN
2024-01-04	2.0
2024-01-05	4.0
2024-01-06	NaN
2024-01-07	NaN
2024-01-08	0.0
2024-01-09	3.0
2024-01-10	NaN

Date	Value
2024-01-01	8.000000
2024-01-02	6.000000
2024-01-03	4.000000
2024-01-04	2.000000
2024-01-05	4.000000
2024-01-06	2.666667
2024-01-07	1.333333
2024-01-08	0.000000
2024-01-09	3.000000
2024-01-10	3.000000

列印頁面