機器學習 - 向後消除法

向後消除法是一種在機器學習中使用的特徵選擇技術，用於為預測模型選擇最重要的特徵。在這種技術中，我們首先考慮所有特徵，然後迭代地去除最不重要的特徵，直到我們得到提供最佳效能的最佳特徵子集。

Python 實現

要在 Python 中實現向後消除法，您可以按照以下步驟操作：

匯入必要的庫：pandas、numpy 和 statsmodels.api。

import pandas as pd
import numpy as np
import statsmodels.api as sm

將您的資料集載入到 Pandas DataFrame 中。我們將使用 Pima-Indians-Diabetes 資料集。

diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')

定義預測變數 (X) 和目標變數 (y)。

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

向預測變數新增一列 1 來表示截距。

X = np.append(arr = np.ones((len(X), 1)).astype(int), values = X, axis = 1)

使用 statsmodels 庫中的普通最小二乘法 (OLS) 來擬合包含所有預測變數的多元線性迴歸模型。

X_opt = X[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

檢查每個預測變數的 p 值，並去除 p 值最高的那個（即最不重要的）。

regressor_OLS.summary()

重複步驟 5 和 6，直到所有剩餘預測變數的 p 值都低於顯著性水平（例如，0.05）。

X_opt = X[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 3]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

p 值低於顯著性水平的最終預測變數子集是模型的最佳特徵集。

示例

以下是 Python 中向後消除法的完整實現：

# Importing the necessary libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load the diabetes dataset
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')

# Define the predictor variables (X) and the target variable (y)
X = diabetes.iloc[:, :-1].values
y = diabetes.iloc[:, -1].values

# Add a column of ones to the predictor variables to represent the intercept
X = np.append(arr = np.ones((len(X), 1)).astype(int), values = X, axis = 1)

# Fit the multiple linear regression model with all the predictor variables
X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

# Check the p-values of each predictor variable and remove the one
# with the highest p-value (i.e., the least significant)
regressor_OLS.summary()

# Repeat the above step until all the remaining predictor variables
# have a p-value below the significance level (e.g., 0.05)
X_opt = X[:, [0, 1, 2, 3, 5, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 1, 3, 5, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 1, 3, 5, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 1, 3, 5, 7]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

輸出

執行此程式時，將產生以下輸出：

列印頁面