使用Python和scikit-learn開發機器學習模型

機器學習是人工智慧的一個分支，它允許機器在沒有明確程式設計的情況下自主學習和改進。Scikit-learn是一個流行的Python機器學習庫，它提供了各種用於預測建模、資料探勘和資料分析的工具。

在本教程中，我們將探討如何使用scikit-learn庫開發機器學習模型。我們將首先簡要介紹機器學習和scikit-learn庫。然後，我們將進入主要內容，包括資料預處理、模型選擇、模型訓練和模型評估。我們將使用示例資料集來演示機器學習過程的每個步驟。

在本教程結束時，您將對如何使用Python和scikit-learn庫開發機器學習模型有紮實的理解。

入門

在深入使用scikit-learn庫之前，我們需要使用pip安裝該庫。

但是，由於它不是內建的，我們必須首先安裝scikit-learn庫。這可以使用pip包管理器完成。

要安裝scikit-learn庫，請開啟您的終端並鍵入以下命令：

pip install scikit−learn

這將下載並安裝scikit-learn庫及其依賴項。安裝完成後，我們可以開始使用scikit-learn並利用其模組！

步驟1：資料預處理

構建機器學習模型的第一步是準備資料。scikit-learn庫提供了各種用於資料預處理的工具，例如處理缺失值、編碼分類變數和縮放資料。讓我們來看一些例子。

# Import the necessary libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load the dataset
dataset = pd.read_csv('data.csv')

# Handle missing values
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(dataset.iloc[:, 1:3])
dataset.iloc[:, 1:3] = imputer.transform(dataset.iloc[:, 1:3])

# Encode categorical variables
labelencoder = LabelEncoder()
dataset.iloc[:, 0] = labelencoder.fit_transform(dataset.iloc[:, 0])

# Scale the data
scaler = StandardScaler()
dataset.iloc[:, 1:3] = scaler.fit_transform(dataset.iloc[:, 1:3])

在這段程式碼中，我們首先使用pandas庫載入資料集。然後，我們透過用該列的平均值替換缺失值來處理缺失值。接下來，我們對分類變數進行編碼，最後，我們對資料進行縮放。

步驟2：模型選擇

資料預處理完成後，下一步是為我們的問題選擇合適的模型。scikit-learn庫為不同型別的問題提供了各種模型，例如分類、迴歸和聚類。讓我們來看一個選擇分類模型的例子。

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(dataset.iloc[:, 1:3], dataset.iloc[:, 0], test_size=0.2, random_state=0)

# Train the K-NN model
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

# Predict the test set results
y_pred = classifier.predict(X_test)

在這段程式碼中，我們首先使用train_test_split函式將資料集分成訓練集和測試集。然後，我們使用KNeighborsClassifier類訓練K-NN（K最近鄰）分類模型。最後，我們使用predict方法預測測試集的結果。

步驟3：模型訓練

準備資料後，我們可以訓練我們的機器學習模型。Scikit-learn提供了各種機器學習模型，例如決策樹、隨機森林、支援向量機等等。

在這個例子中，我們將使用鳶尾花資料集訓練一個決策樹分類器。程式碼如下：

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# create the model
clf = DecisionTreeClassifier()

# train the model
clf.fit(X_train, y_train)

# test the model
accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)

首先，我們使用train_test_split函式將資料分成訓練集和測試集。此函式將資料隨機分成兩部分，一部分用於訓練，另一部分用於測試。我們指定test_size引數來指示用於測試的資料百分比。

接下來，我們建立一個DecisionTreeClassifier類的例項，並使用訓練資料對其進行訓練。最後，我們使用測試資料測試模型並計算模型的準確性。

這段程式碼的輸出將是模型在測試資料的準確性。準確性將根據用於分割資料的隨機狀態而有所不同。

步驟4：模型評估

訓練模型後，我們需要評估其效能。Scikit-learn提供了多個用於評估機器學習模型的指標，包括準確性、精確度、召回率、F1分數等等。

在這個例子中，我們將使用混淆矩陣和分類報告來評估我們決策樹分類器的效能。程式碼如下：

from sklearn.metrics import confusion_matrix, classification_report

# make predictions on the test data
y_pred = clf.predict(X_test)

# print the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

首先，我們使用DecisionTreeClassifier例項的predict方法對測試資料進行預測。然後，我們使用sklearn.metrics模組中的confusion_matrix和classification_report函式列印混淆矩陣和分類報告。

混淆矩陣顯示真陽性、假陽性、真陰性和假陰性的數量。分類報告顯示每個類別的精確度、召回率、F1分數和支援度。

步驟5：模型部署

訓練和評估模型後，我們可以將其部署以對新資料進行預測。以下是如何使用訓練好的決策樹分類器預測新的鳶尾花物種的示例：

# create a new iris flower
new_flower = [[5.1, 3.5, 1.4, 0.2]]

# make a prediction
prediction = clf.predict(new_flower)

# print the prediction
print("Prediction:", iris.target_names[prediction[0]])

我們建立一朵新的鳶尾花，其四個測量值與資料集中的其他花朵相同。然後，我們使用訓練好的DecisionTreeClassifier例項的predict方法對新資料進行預測。最後，我們列印預測的花的物種。

輸出

它將產生以下輸出：

Prediction: setosa

結論

在本教程中，我們學習瞭如何使用Python和scikit-learn庫開發機器學習模型。我們涵蓋了資料準備、模型訓練、模型評估和模型部署的基礎知識。

S Vijay Balaji

更新於：2023年8月31日

瀏覽量：131

啟動你的職業生涯

完成課程獲得認證

開始