Scikit Learn - K近鄰分類器



此分類器名稱中的K代表k個最近鄰,其中k是由使用者指定的整數。因此,顧名思義,此分類器實現基於k個最近鄰的學習。k值的選取取決於資料。讓我們藉助一個實現示例來進一步理解它。

實現示例

在這個例子中,我們將使用scikit-learn的KNeighborsClassifier在名為Iris Flower資料集的資料集上實現KNN。

  • 此資料集每個不同的鳶尾花物種(setosa、versicolor、virginica)有50個樣本,即總共150個樣本。

  • 對於每個樣本,我們有4個特徵,分別命名為萼片長度、萼片寬度、花瓣長度、花瓣寬度。

首先,匯入資料集並列印特徵名稱,如下所示:

from sklearn.datasets import load_iris
iris = load_iris()
print(iris.feature_names)

輸出

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

示例

現在我們可以列印目標,即表示不同物種的整數。這裡0 = setosa,1 = versicolor,2 = virginica。

print(iris.target)

輸出

[
   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
   1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
   2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
   2 2
]

示例

以下程式碼行將顯示目標的名稱:

print(iris.target_names)

輸出

['setosa' 'versicolor' 'virginica']

示例

我們可以使用以下程式碼行檢查觀察值和特徵的數量(iris資料集有150個觀察值和4個特徵):

print(iris.data.shape)

輸出

(150, 4)

現在,我們需要將資料分成訓練資料和測試資料。我們將使用Sklearn的train_test_split函式將資料按70(訓練資料)和30(測試資料)的比例分割:

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

接下來,我們將使用Sklearn預處理模組進行資料縮放,如下所示:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

示例

以下程式碼行將給出train和test物件的形狀:

print(X_train.shape)
print(X_test.shape)

輸出

(105, 4)
(45, 4)

示例

以下程式碼行將給出新的y物件的形狀:

print(y_train.shape)
print(y_test.shape)

輸出

(105,)
(45,)

接下來,從Sklearn匯入KNeighborsClassifier類,如下所示:

from sklearn.neighbors import KNeighborsClassifier

為了檢查準確性,我們需要匯入Metrics模型,如下所示:

from sklearn import metrics
We are going to run it for k = 1 to 15 and will be recording testing accuracy, plotting it, showing confusion matrix and classification report:
Range_k = range(1,15)
scores = {}
scores_list = []
for k in range_k:
   classifier = KNeighborsClassifier(n_neighbors=k)
   classifier.fit(X_train, y_train)
   y_pred = classifier.predict(X_test)
   scores[k] = metrics.accuracy_score(y_test,y_pred)
   scores_list.append(metrics.accuracy_score(y_test,y_pred))
result = metrics.confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = metrics.classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)

示例

現在,我們將繪製K值與相應的測試精度之間的關係。這將使用matplotlib庫完成。

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(k_range,scores_list)
plt.xlabel("Value of K")
plt.ylabel("Accuracy")

輸出

Confusion Matrix:
[
   [15 0 0]
   [ 0 15 0]
   [ 0 1 14]
]
Classification Report:
            precision   recall   f1-score    support
         0     1.00     1.00        1.00        15
         1     0.94     1.00        0.97        15
         2     1.00     0.93        0.97        15

micro avg      0.98     0.98        0.98        45
macro avg      0.98     0.98        0.98        45
weighted avg   0.98     0.98        0.98        45

Text(0, 0.5, 'Accuracy')
Kneighbors Classifier

示例

對於上述模型,我們可以選擇K的最優值(由於精度在此範圍內最高,因此任何介於6到14之間的值),例如8,並重新訓練模型,如下所示:

classifier = KNeighborsClassifier(n_neighbors = 8)
classifier.fit(X_train, y_train)

輸出

KNeighborsClassifier(
   algorithm = 'auto', leaf_size = 30, metric = 'minkowski',
   metric_params = None, n_jobs = None, n_neighbors = 8, p = 2,
   weights = 'uniform'
)
classes = {0:'setosa',1:'versicolor',2:'virginicia'}
x_new = [[1,1,1,1],[4,3,1.3,0.2]]
y_predict = rnc.predict(x_new)
print(classes[y_predict[0]])
print(classes[y_predict[1]])

輸出

virginicia
virginicia

完整的可執行程式

from sklearn.datasets import load_iris
iris = load_iris()
print(iris.target_names)
print(iris.data.shape)
X = iris.data[:, :4]
y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

print(X_train.shape)
print(X_test.shape)

from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics

Range_k = range(1,15)
scores = {}
scores_list = []
for k in range_k:
   classifier = KNeighborsClassifier(n_neighbors=k)
   classifier.fit(X_train, y_train)
   y_pred = classifier.predict(X_test)
   scores[k] = metrics.accuracy_score(y_test,y_pred)
   scores_list.append(metrics.accuracy_score(y_test,y_pred))
result = metrics.confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = metrics.classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(k_range,scores_list)
plt.xlabel("Value of K")
plt.ylabel("Accuracy")

classifier = KNeighborsClassifier(n_neighbors=8)
classifier.fit(X_train, y_train)

classes = {0:'setosa',1:'versicolor',2:'virginicia'}
x_new = [[1,1,1,1],[4,3,1.3,0.2]]
y_predict = rnc.predict(x_new)
print(classes[y_predict[0]])
print(classes[y_predict[1]])
廣告