機器學習 - 決策樹演算法

決策樹演算法是一種基於層次樹的演算法，用於根據一組規則對結果進行分類或預測。它的工作原理是根據輸入特徵的值將資料分成子集。該演算法遞迴地分割資料，直到到達每個子集中的資料屬於同一類或目標變數具有相同值的地步。生成的樹是一組可以用來預測或對新資料進行分類的決策規則。

決策樹演算法的工作原理是在每個節點選擇最佳特徵來分割資料。最佳特徵是提供最大資訊增益或最大熵減小的特徵。資訊增益是衡量在特定特徵處分割資料所獲得的資訊量，而熵是衡量資料中隨機性或無序性的量度。該演算法使用這些度量來確定在每個節點分割資料的最佳特徵。

下面是一個二叉樹的例子，用於預測一個人是否健康，提供了年齡、飲食習慣和運動習慣等各種資訊：

在上圖的決策樹中，問題是決策節點，最終結果是葉子節點。

決策樹演算法的型別

決策樹演算法主要有兩種型別：

分類樹 - 分類樹用於將資料分類到不同的類別或範疇。它的工作原理是根據輸入特徵的值將資料分成子集，並將每個子集分配給不同的類別。
迴歸樹 - 迴歸樹用於預測數值或連續變數。它的工作原理是根據輸入特徵的值將資料分成子集，並將每個子集分配一個數值。

Python實現

讓我們使用一個名為Iris資料集的流行資料集在Python中實現決策樹演算法，用於分類任務。它包含150個鳶尾花樣本，每個樣本具有四個特徵：萼片長度、萼片寬度、花瓣長度和花瓣寬度。這些花屬於三個類別：setosa、versicolor和virginica。

首先，我們將匯入必要的庫並載入資料集：

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the iris dataset
iris = load_iris()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.3, random_state=0)

然後，我們建立一個決策樹分類器的例項，並在訓練集上對其進行訓練：

# Create a Decision Tree classifier
dtc = DecisionTreeClassifier()

# Fit the classifier to the training data
dtc.fit(X_train, y_train)

現在，我們可以使用訓練好的分類器對測試集進行預測：

# Make predictions on the testing data
y_pred = dtc.predict(X_test)

我們可以透過計算其準確率來評估分類器的效能：

# Calculate the accuracy of the classifier
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("Accuracy:", accuracy)

我們可以使用Matplotlib庫視覺化決策樹：

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Visualize the Decision Tree using Matplotlib
plt.figure(figsize=(20,10))
plot_tree(dtc, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()

可以使用`sklearn.tree`模組中的`plot_tree`函式來繪製決策樹。我們可以傳入訓練好的決策樹分類器，`filled`引數用於用顏色填充節點，`feature_names`引數用於標記特徵，`class_names`引數用於標記目標類別。我們還指定`figsize`引數來設定圖形的大小，並呼叫`show`函式來顯示繪圖。

完整的實現示例

以下是使用iris資料集在python中實現決策樹分類演算法的完整實現示例：

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the iris dataset
iris = load_iris()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)

# Create a Decision Tree classifier
dtc = DecisionTreeClassifier()

# Fit the classifier to the training data
dtc.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = dtc.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("Accuracy:", accuracy)

# Visualize the Decision Tree using Matplotlib
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dtc, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()

輸出

這將建立一個決策樹圖，如下所示：

Accuracy: 0.9777777777777777

正如你所看到的，該圖顯示了決策樹的結構，每個節點代表基於特徵值做出的決策，每個葉節點代表一個類別或數值。每個節點的顏色表示該節點中樣本的主要類別或值，底部的數字表示到達該節點的樣本數量。

列印頁面