如何在Python中將資料集分割成訓練集和測試集？

Python 伺服器端程式設計程式設計

在本教程中，我們將學習如何使用Python程式語言將資料集分割成訓練集和測試集。

介紹

在建立機器學習和深度學習模型時，我們可能會遇到需要在同一資料集上進行訓練和評估的情況。在這種情況下，我們可能希望將資料集分成不同的組或集合，並將每個集合用於一項任務或特定過程（例如訓練）。在這種情況下，我們可以使用訓練集/測試集。

訓練集和測試集的必要性

這是非常重要且簡單的預處理技術之一。機器學習模型中常見的難題是過擬合或欠擬合。過擬合是指模型在訓練資料上表現非常好，但在未見樣本上卻無法泛化。如果模型學習了資料中的噪聲，則可能會發生這種情況。

另一個問題是欠擬合，其中模型在訓練資料上的表現不佳，因此無法很好地泛化。如果訓練資料不足，則可能會發生這種情況。

為了克服這些問題，最簡單的技術之一是將資料集分成訓練集和測試集。訓練集用於訓練模型或學習模型引數。測試集通常用於評估模型在未見資料上的效能。

一些術語

訓練集

用於訓練模型的資料集部分。這通常可以取整個資料集的大約70%，但使用者可以嘗試其他百分比，例如60%或80%，或根據用例而定。資料集的這一部分用於學習和擬合模型的引數。

測試集

用於評估模型的資料集部分。這通常可以取整個資料集的大約30%，但使用者可以嘗試其他百分比，例如40%或20%，或根據用例而定。

通常，我們將資料集按照我們的需求劃分成70:30或80:20等比例的訓練集和測試集。

在Python中將資料集分割成訓練集和測試集

基本上有三種方法可以實現資料集的分割

使用sklearn的train_test_split
使用numpy索引
使用pandas

讓我們簡要了解一下上述每種方法

1. 使用sklearn的train_test_split

示例

import numpy as np
from sklearn.model_selection import train_test_split
x = np.arange(0, 50).reshape(10, 5)
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0])
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.3,
random_state=4)

print("Shape of x_train is ",x_train.shape)
print("Shape of x_test is ",x_test.shape)
print("Shape of y_train is ",y_train.shape)
print("Shape of y_test is ",x_test.shape)

輸出

Shape of x_train is (7, 5)
Shape of x_test is (3, 5)
Shape of y_train is (7,)
Shape of y_test is (3, 5)

2. 使用numpy索引

示例

import numpy as np
x = np.random.rand(100, 5)
y = np.random.rand(100,1)
x_train, x_test = x[:80,:], x[80:,:]
y_train, y_test = y[:80,:], y[80:,:]

print("Shape of x_train is ",x_train.shape)
print("Shape of x_test is ",x_test.shape)
print("Shape of y_train is ",y_train.shape)
print("Shape of y_test is ",x_test.shape)

輸出

Shape of x_train is (80, 5)
Shape of x_test is (20, 5)
Shape of y_train is (80, 1)
Shape of y_test is (20, 5)

3. 使用pandas sample

示例

import pandas as pd 
import numpy as np 
data = np.random.randint(10,25,size=(5,3)) 
df = pd.DataFrame(data, columns=['col1','col2','col3']) 
train_df = df.sample(frac=0.8, random_state=100) 
test_df = df[~df.index.isin(train_df.index)] 

print("Dataset shape : {}".format(df.shape)) 
print("Train dataset shape : {}".format(train_df.shape)) 
print("Test dataset shape : {}".format(test_df.shape))

輸出

Dataset shape : (5, 3) Train dataset shape : (4, 3) Test dataset shape : (1, 3)

結論

訓練測試分割是python和機器學習任務中非常重要的預處理步驟。它有助於防止過擬合和欠擬合問題。

Mithilesh Pradhan

更新於：2022年12月1日

1000+ 次瀏覽

啟動你的職業生涯

透過完成課程獲得認證

開始學習