鳶尾花資料集的探索性資料分析

介紹

在機器學習和資料科學中，探索性資料分析是檢查資料集並總結其主要特徵的過程。它可能包括視覺化方法來更好地表示這些特徵或對資料集有一個總體的瞭解。它是資料科學生命週期中非常重要的一步，通常會消耗一定的時間。

在本文中，我們將透過探索性資料分析瞭解鳶尾花資料集的一些特徵。

鳶尾花資料集

鳶尾花資料集非常簡單，通常被稱為“Hello World”。資料集包含三種不同花卉（山鳶尾、維吉尼亞鳶尾和變色鳶尾）的 4 個特徵。這些特徵是萼片長度、萼片寬度、花瓣長度和花瓣寬度。資料集包含 150 個數據點，每個物種 50 個數據點。

鳶尾花資料集上的 EDA

首先，讓我們使用 pandas 從 CSV 檔案“iris_csv.csv”載入資料集，並對其進行總體概述。

資料集可以從以下連結下載。

https://datahub.io/machine-learning/iris/r/iris.csv

程式碼實現

示例 1

import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline 

df = pd.read_csv("/content/iris_csv.csv") 
df.head()

	萼片長度	萼片寬度	花瓣長度	花瓣寬度	類別
0	5.1	3.5	1.4	0.2	山鳶尾
1	4.9	3.0	1.4	0.2	山鳶尾
3	4.6	3.1	1.5	0.2	山鳶尾
4	5.0	3.6	1.4	0.2	山鳶尾

示例 2

df.info()

RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sepallength  150 non-null    float64
 1   sepalwidth   150 non-null    float64
 2   petallength  150 non-null    float64
 3   petalwidth   150 non-null    float64
 4   class        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
df.shape

(150, 5)


## Statistics about dataset
df.describe()

	萼片長度	萼片寬度	花瓣長度	花瓣寬度
計數	150.000000	150.000000	150.000000	150.000000
平均值	5.843333	3.054000	3.758667	1.198667
標準差	0.828066	0.433594	1.764420	0.763161
最小值	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
最大值	7.900000	4.400000	6.900000	2.500000

示例 3

## checking for null values

df.isnull().sum()

sepallength    0
sepalwidth     0
petallength    0
petalwidth     0
class          0
dtype: int64

## Univariate analysis
df.groupby('class').agg(['mean', 'median'])  # passing a list of recognized strings
df.groupby('class').agg([np.mean, np.median])

	萼片長度		萼片寬度		花瓣長度		花瓣寬度
	平均值	中位數	平均值	中位數	平均值	中位數	平均值	中位數
類別
山鳶尾	5.006	5.0	3.418	3.4	1.464	1.50	0.244	0.2
變色鳶尾	5.936	5.9	2.770	2.8	4.260	4.35	1.326	1.3
維吉尼亞鳶尾	6.588	6.5	2.974	3.0	5.552	5.55	2.026	2.0

示例 4

## Box plot 
plt.figure(figsize=(8,4)) 
sns.boxplot(x='class',y='sepalwidth',data=df ,palette='YlGnBu')

示例 5

## Distribution of particular species
sns.distplot(a=df['petalwidth'], bins=40, color='b')
plt.title('petal width distribution plot')

示例 6

## count of number of observation of each species

sns.countplot(x='class',data=df)

示例 7

## Correlation map using a heatmap matrix

sns.heatmap(df.corr(), linecolor='white', linewidths=1)

示例 8

## Multivariate analysis – analyis between two or more variable or features
## Scatter plot to see the relation between two or more features like sepal length, petal length,etc
axis = plt.axes()

axis.scatter(df.sepallength, df.sepalwidth)

axis.set(xlabel='Sepal_Length (cm)',
   ylabel='Sepal_Width (cm)',
   title='Sepal-Length vs Width');

示例 9

sns.scatterplot(x='sepallength', y='sepalwidth', hue='class', data=df,
plt.show()

示例 10

## From the above graph we can see that
# Iris-virginica has a longer sepal length while Iris-setosa has larger sepal width
# For setosa sepal width is more than sepal length
## Below is the Frequency histogram plot of all features
axis = df.plot.hist(bins=30, alpha=0.5)
axis.set_xlabel('Size in cm');

示例 11

# From the above graph we can see that sepalwidth is longer than any other feature followed by petalwidth
## examining correlation
sns.pairplot(df, hue='class')

示例 12

figure, ax = plt.subplots(2, 2, figsize=(8,8))

ax[0,0].set_title("sepallength")
ax[0,0].hist(df['sepallength'], bins=8)

ax[0,1].set_title("sepalwidth")
ax[0,1].hist(df['sepalwidth'], bins=6);

ax[1,0].set_title("petallength")
ax[1,0].hist(df['petallength'], bins=5);

ax[1,1].set_title("petalwidth")
ax[1,1].hist(df['petalwidth'], bins=5);

示例 13

# From the above plot we can see that –
# - Sepal length highest freq lies between 5.5 cm to 6 cm which is 30-35 cm
# - Petal length highest freq lies between 1 cm to 2 cm which is 50 cm
# - Sepal width highest freq lies between 3 cm to 3.5 cm which is 70 cm
# - Petal width highest freq lies between 0 cm to 0.5 cm which is 40-45 cm

結論

探索性資料分析被資料科學家和分析師廣泛使用。它揭示了給定資料的許多特徵、其分佈以及它如何有用。

Mithilesh Pradhan

更新於： 2022-12-30

5K+ 瀏覽量

開啟您的職業生涯

透過完成課程獲得認證

開始學習