機器學習 - 前向特徵構造



前向特徵構造是機器學習中的一種特徵選擇方法,我們從一個空的特徵集開始,並在每一步迭代地新增效能最好的特徵,直到達到所需的特徵數量。

特徵選擇的目的是識別與預測目標變數相關的最重要的特徵,同時忽略那些為模型增加噪聲並可能導致過擬合的不太重要的特徵。

前向特徵構造涉及以下步驟:

  • 初始化一個空的特徵集。

  • 設定要選擇的最大特徵數。

  • 迭代直到達到所需的特徵數:

    • 對於每個尚未包含在已選擇特徵集中的剩餘特徵,使用已選擇特徵和當前特徵擬合一個模型,並使用驗證集評估其效能。

    • 選擇導致最佳效能的特徵,並將其新增到已選擇特徵集中。

  • 將已選擇特徵集作為模型的最佳特徵集返回。

前向特徵構造的主要優點是計算效率高,可用於高維資料集。但是,它可能並不總是產生最佳的特徵集,尤其是在特徵之間存在高度相關性或特徵與目標變數之間存在非線性關係的情況下。

示例

以下是如何在 Python 中實現前向特徵構造的示例:

# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the diabetes dataset
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')

# Define the predictor variables (X) and the target variable (y)
X = diabetes.iloc[:, :-1].values
y = diabetes.iloc[:, -1].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Create an empty set of features
selected_features = set()

# Set the maximum number of features to be selected
max_features = 8

# Iterate until the desired number of features is reached
while len(selected_features) < max_features:

   # Set the best feature and the best score to be 0
   best_feature = None
   best_score = 0
   
   # Iterate over all the remaining features
   for i in range(X_train.shape[1]):

      # Skip the feature if it's already selected
      if i in selected_features:
         continue
      
      # Select the current feature and fit a linear regression model
      X_train_selected = X_train[:, list(selected_features) + [i]]
      regressor = LinearRegression()
      regressor.fit(X_train_selected, y_train)
      
      # Compute the score on the testing set
      X_test_selected = X_test[:, list(selected_features) + [i]]
      score = regressor.score(X_test_selected, y_test)

      # Update the best feature and score if the current feature performs better
      if score > best_score:
         best_feature = i
         best_score = score

   # Add the best feature to the set of selected features
   selected_features.add(best_feature)
   
   # Print the selected features and the score
   print('Selected Features:', list(selected_features))
   print('Score:', best_score)

輸出

執行後,它將生成以下輸出:

Selected Features: [1]
Score: 0.23530716168783583
Selected Features: [0, 1]
Score: 0.2923143573608237
Selected Features: [0, 1, 5]
Score: 0.3164103491569179
Selected Features: [0, 1, 5, 6]
Score: 0.3287368302427327
Selected Features: [0, 1, 2, 5, 6]
Score: 0.334586804842275
Selected Features: [0, 1, 2, 3, 5, 6]
Score: 0.3356264736550455
Selected Features: [0, 1, 2, 3, 4, 5, 6]
Score: 0.3313166516703744
Selected Features: [0, 1, 2, 3, 4, 5, 6, 7]
Score: 0.32230203252064216
廣告