Python 中的客戶流失預測
每個企業都依賴於客戶的忠誠度。來自客戶的重複業務是企業盈利的基礎之一。因此,瞭解客戶離開企業的原因非常重要。客戶流失被稱為客戶流失率。透過觀察過去的趨勢,我們可以判斷哪些因素會影響客戶流失率,以及如何預測特定客戶是否會離開企業。在本文中,我們將使用機器學習演算法來研究客戶流失率的過去趨勢,然後判斷哪些客戶可能會流失。
資料準備
作為示例,本文將考慮電信客戶流失率。源資料可在 Kaggle 上獲取。下載資料的 URL 在下面的程式中提到。我們使用 Pandas 庫將 csv 檔案載入到 Python 程式中,並檢視一些示例行。
示例
import pandas as pd #Loading the Telco-Customer-Churn.csv dataset #https://www.kaggle.com/blastchar/telco-customer-churn datainput = pd.read_csv('E:\Telecom_customers.csv') print("Given input data :\n",datainput)
輸出
執行以上程式碼將得到以下結果:
Given input data : customerID gender SeniorCitizen ... MonthlyCharges TotalCharges Churn 0 7590-VHVEG Female 0 ... 29.85 29.85 No 1 5575-GNVDE Male 0 ... 56.95 1889.5 No 2 3668-QPYBK Male 0 ... 53.85 108.15 Yes 3 7795-CFOCW Male 0 ... 42.30 1840.75 No 4 9237-HQITU Female 0 ... 70.70 151.65 Yes ... ... ... ... ... ... ... ... 7038 6840-RESVB Male 0 ... 84.80 1990.5 No 7039 2234-XADUH Female 0 ... 103.20 7362.9 No 7040 4801-JZAZL Female 0 ... 29.60 346.45 No 7041 8361-LTMKD Male 1 ... 74.40 306.6 Yes 7042 3186-AJIEK Male 0 ... 105.65 6844.5 No [7043 rows x 21 columns]
研究現有模式
接下來,我們研究資料集以查詢流失發生的現有模式。我們還刪除一些對條件沒有影響的資料列。例如,客戶 ID 列不會影響客戶是否離開,因此我們使用 drop 方法刪除此類列。然後,我們繪製一個圖表,顯示給定資料集中流失的百分比。
示例 2
import pandas as pd import matplotlib.pyplot as plt from matplotlib import rcParams #Loading the Telco-Customer-Churn.csv dataset #https://www.kaggle.com/blastchar/telco-customer-churn datainput = pd.read_csv('E:\Telecom_customers.csv') print("Given input data :\n",datainput) #Dropping columns datainput.drop(['customerID'], axis=1, inplace=True) datainput.pop('TotalCharges') datainput['OnlineBackup'].unique() data = datainput['Churn'].value_counts(sort = True) chroma = ["#BDFCC9","#FFDEAD"] rcParams['figure.figsize'] = 9,9 explode = [0.2,0.2] plt.pie(data, explode=explode, colors=chroma, autopct='%1.1f%%', shadow=True, startangle=180,) plt.title('Percentage of Churn in the given Data') plt.show()
輸出
執行以上程式碼將得到以下結果:
資料預處理
為了使資料能夠被機器學習演算法使用,我們對所有欄位進行標記。我們還將文字值轉換為數字標記。例如,性別列中的值將更改為 0 和 1,而不是男性和女性。這有助於在計算和演算法中使用這些欄位,這些計算和演算法將評估這些欄位對流失值的影響。我們使用 sklearn 中的 LabelEncoder 方法。
示例 3
import pandas as pd from sklearn import preprocessing label_encoder = preprocessing.LabelEncoder() datainput['gender'] = label_encoder.fit_transform(datainput['gender']) datainput['Partner'] = label_encoder.fit_transform(datainput['Partner']) datainput['Dependents'] = label_encoder.fit_transform(datainput['Dependents']) datainput['PhoneService'] = label_encoder.fit_transform(datainput['PhoneService']) datainput['MultipleLines'] = label_encoder.fit_transform(datainput['MultipleLines']) datainput['InternetService'] = label_encoder.fit_transform(datainput['InternetService']) datainput['OnlineSecurity'] = label_encoder.fit_transform(datainput['OnlineSecurity']) datainput['OnlineBackup'] = label_encoder.fit_transform(datainput['OnlineBackup']) datainput['DeviceProtection'] = label_encoder.fit_transform(datainput['DeviceProtection']) datainput['TechSupport'] = label_encoder.fit_transform(datainput['TechSupport']) datainput['StreamingTV'] = label_encoder.fit_transform(datainput['StreamingTV']) datainput['StreamingMovies'] = label_encoder.fit_transform(datainput['StreamingMovies']) datainput['Contract'] = label_encoder.fit_transform(datainput['Contract']) datainput['PaperlessBilling'] = label_encoder.fit_transform(datainput['PaperlessBilling']) datainput['PaymentMethod'] = label_encoder.fit_transform(datainput['PaymentMethod']) datainput['Churn'] = label_encoder.fit_transform(datainput['Churn']) print("input data after label encoder :\n",datainput) #separating features(X) and label(y) datainput["Churn"] = datainput["Churn"].astype(int) y = datainput["Churn"].values X = datainput.drop(labels = ["Churn"],axis = 1) print("\nseparated X and y :") print("y -",y) print("X -",X)
輸出
執行以上程式碼將得到以下結果:
input data after label encoder customerID gender SeniorCitizen ... MonthlyCharges TotalCharges Churn 0 7590-VHVEG 0 0 ... 29.85 29.85 0 1 5575-GNVDE 1 0 ... 56.95 1889.5 0 2 3668-QPYBK 1 0 ... 53.85 108.15 1 3 7795-CFOCW 1 0 ... 42.30 1840.75 0 4 9237-HQITU 0 0 ... 70.70 151.65 1 ... ... ... ... ... ... ... ... 7038 6840-RESVB 1 0 ... 84.80 1990.5 0 7039 2234-XADUH 0 0 ... 103.20 7362.9 0 7040 4801-JZAZL 0 0 ... 29.60 346.45 0 7041 8361-LTMKD 1 1 ... 74.40 306.6 1 7042 3186-AJIEK 1 0 ... 105.65 6844.5 0 [7043 rows x 21 columns] separated X and y : y - [0 0 1 ... 0 1 0] X - customerID gender ... MonthlyCharges TotalCharges 0 7590-VHVEG 0 ... 29.85 29.85 1 5575-GNVDE 1 ... 56.95 1889.5 2 3668-QPYBK 1 ... 53.85 108.15 3 7795-CFOCW 1 ... 42.30 1840.75 4 9237-HQITU 0 ... 70.70 151.65 ... ... ... ... ... ... 7038 6840-RESVB 1 ... 84.80 1990.5 7039 2234-XADUH 0 ... 103.20 7362.9 7040 4801-JZAZL 0 ... 29.60 346.45 7041 8361-LTMKD 1 ... 74.40 306.6 7042 3186-AJIEK 1 ... 105.65 6844.5 [7043 rows x 20 columns]
訓練和測試資料
現在我們將資料集分成兩部分。一個是用於訓練,另一個是用於測試。test_size 引數用於決定將有多少百分比的資料集僅用於測試。此練習將幫助我們對正在建立的模型更有信心。然後我們應用邏輯迴歸演算法並找出預測值。
示例
import pandas as pd import warnings warnings.filterwarnings("ignore") from sklearn.linear_model import LogisticRegression #Loading the Telco-Customer-Churn.csv dataset with pandas datainput = pd.read_csv('E:\Telecom_customers.csv') datainput.drop(['customerID'], axis=1, inplace=True) datainput.pop('TotalCharges') datainput['OnlineBackup'].unique() #LabelEncoder() from sklearn import preprocessing label_encoder = preprocessing.LabelEncoder() datainput['gender'] = label_encoder.fit_transform(datainput['gender']) datainput['Partner'] = label_encoder.fit_transform(datainput['Partner']) datainput['Dependents'] = label_encoder.fit_transform(datainput['Dependents']) datainput['PhoneService'] = label_encoder.fit_transform(datainput['PhoneService']) datainput['MultipleLines'] = label_encoder.fit_transform(datainput['MultipleLines']) datainput['InternetService'] = label_encoder.fit_transform(datainput['InternetService']) datainput['OnlineSecurity'] = label_encoder.fit_transform(datainput['OnlineSecurity']) datainput['OnlineBackup'] = label_encoder.fit_transform(datainput['OnlineBackup']) datainput['DeviceProtection'] = label_encoder.fit_transform(datainput['DeviceProtection']) datainput['TechSupport'] = label_encoder.fit_transform(datainput['TechSupport']) datainput['StreamingTV'] = label_encoder.fit_transform(datainput['StreamingTV']) datainput['StreamingMovies'] = label_encoder.fit_transform(datainput['StreamingMovies']) datainput['Contract'] = label_encoder.fit_transform(datainput['Contract']) datainput['PaperlessBilling'] = label_encoder.fit_transform(datainput['PaperlessBilling']) datainput['PaymentMethod'] = label_encoder.fit_transform(datainput['PaymentMethod']) datainput['Churn'] = label_encoder.fit_transform(datainput['Churn']) #print("input data after label encoder :\n",datainput) #separating features(X) and label(y) datainput["Churn"] = datainput["Churn"].astype(int) Y = datainput["Churn"].values X = datainput.drop(labels = ["Churn"],axis = 1) #train_test_split method from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2) #LogisticRegression classifier=LogisticRegression() classifier.fit(X_train,Y_train) Y_pred=classifier.predict(X_test) print("\npredicted values :\n",Y_pred)
輸出
執行以上程式碼將得到以下結果:
predicted values : [0 0 1 ... 0 1 0]
查詢評估引數
一旦以上步驟中的準確度水平達到可接受的水平,我們就透過查詢不同的引數來進一步評估模型。我們使用準確率和混淆矩陣作為引數來判斷此模型的行為有多準確。較高的準確率值表明該模型更適合。類似地,混淆矩陣顯示真陽性、真陰性、假陽性和假陰性的矩陣。與假值相比,真值的百分比越高,表明模型越好。
示例
import pandas as pd import warnings warnings.filterwarnings("ignore") from sklearn.linear_model import LogisticRegression from sklearn import metrics from sklearn.metrics import confusion_matrix #Loading the Telco-Customer-Churn.csv dataset with pandas datainput = pd.read_csv('E:\Telecom_customers.csv') datainput.drop(['customerID'], axis=1, inplace=True) datainput.pop('TotalCharges') datainput['OnlineBackup'].unique() #LabelEncoder() from sklearn import preprocessing label_encoder = preprocessing.LabelEncoder() datainput['gender'] = label_encoder.fit_transform(datainput['gender']) datainput['Partner'] = label_encoder.fit_transform(datainput['Partner']) datainput['Dependents'] = label_encoder.fit_transform(datainput['Dependents']) datainput['PhoneService'] = label_encoder.fit_transform(datainput['PhoneService']) datainput['MultipleLines'] = label_encoder.fit_transform(datainput['MultipleLines']) datainput['InternetService'] = label_encoder.fit_transform(datainput['InternetService']) datainput['OnlineSecurity'] = label_encoder.fit_transform(datainput['OnlineSecurity']) datainput['OnlineBackup'] = label_encoder.fit_transform(datainput['OnlineBackup']) datainput['DeviceProtection'] = label_encoder.fit_transform(datainput['DeviceProtection']) datainput['TechSupport'] = label_encoder.fit_transform(datainput['TechSupport']) datainput['StreamingTV'] = label_encoder.fit_transform(datainput['StreamingTV']) datainput['StreamingMovies'] = label_encoder.fit_transform(datainput['StreamingMovies']) datainput['Contract'] = label_encoder.fit_transform(datainput['Contract']) datainput['PaperlessBilling'] = label_encoder.fit_transform(datainput['PaperlessBilling']) datainput['PaymentMethod'] = label_encoder.fit_transform(datainput['PaymentMethod']) datainput['Churn'] = label_encoder.fit_transform(datainput['Churn']) #print("input data after label encoder :\n",datainput) #separating features(X) and label(y) datainput["Churn"] = datainput["Churn"].astype(int) Y = datainput["Churn"].values X = datainput.drop(labels = ["Churn"],axis = 1) #train_test_split method from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2) #LogisticRegression classifier=LogisticRegression() classifier.fit(X_train,Y_train) Y_pred=classifier.predict(X_test) #Accuracy LR = metrics.accuracy_score(Y_test, Y_pred) * 100 print("\nThe accuracy score using the LR is -> ",LR) #confusion matrix cm=confusion_matrix(Y_test,Y_pred) print("\nconfusion matrix : \n",cm)
輸出
執行以上程式碼將得到以下結果:
The accuracy score using the LR is -> 80.8374733853797 confusion matrix : [[928 109] [161 211]]
變數權重
接下來,我們判斷每個欄位或變數如何影響流失值。這將幫助我們確定對流失影響較大的特定變數,並嘗試處理這些變數以防止客戶流失。為此,我們將分類器中的係數設定為零,並獲得每個變數的權重。
示例
import pandas as pd import warnings warnings.filterwarnings("ignore") from sklearn.linear_model import LogisticRegression #Loading the dataset with pandas datainput = pd.read_csv('E:\Telecom_customers.csv') datainput.drop(['customerID'], axis=1, inplace=True) datainput.pop('TotalCharges') datainput['OnlineBackup'].unique() #LabelEncoder() from sklearn import preprocessing label_encoder = preprocessing.LabelEncoder() datainput['gender'] = label_encoder.fit_transform(datainput['gender']) datainput['Partner'] = label_encoder.fit_transform(datainput['Partner']) datainput['Dependents'] = label_encoder.fit_transform(datainput['Dependents']) datainput['PhoneService'] = label_encoder.fit_transform(datainput['PhoneService']) datainput['MultipleLines'] = label_encoder.fit_transform(datainput['MultipleLines']) datainput['InternetService'] = label_encoder.fit_transform(datainput['InternetService']) datainput['OnlineSecurity'] = label_encoder.fit_transform(datainput['OnlineSecurity']) datainput['OnlineBackup'] = label_encoder.fit_transform(datainput['OnlineBackup']) datainput['DeviceProtection'] = label_encoder.fit_transform(datainput['DeviceProtection']) datainput['TechSupport'] = label_encoder.fit_transform(datainput['TechSupport']) datainput['StreamingTV'] = label_encoder.fit_transform(datainput['StreamingTV']) datainput['StreamingMovies'] = label_encoder.fit_transform(datainput['StreamingMovies']) datainput['Contract'] = label_encoder.fit_transform(datainput['Contract']) datainput['PaperlessBilling'] = label_encoder.fit_transform(datainput['PaperlessBilling']) datainput['PaymentMethod'] = label_encoder.fit_transform(datainput['PaymentMethod']) datainput['Churn'] = label_encoder.fit_transform(datainput['Churn']) #print("input data after label encoder :\n",datainput) #separating features(X) and label(y) datainput["Churn"] = datainput["Churn"].astype(int) Y = datainput["Churn"].values X = datainput.drop(labels = ["Churn"],axis = 1) # #train_test_split method from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2) # #LogisticRegression classifier=LogisticRegression() classifier.fit(X_train,Y_train) Y_pred=classifier.predict(X_test) #weights of all the variables wt = pd.Series(classifier.coef_[0], index=X.columns.values) print("\nweight of all the variables :") print(wt.sort_values(ascending=False))
輸出
執行以上程式碼將得到以下結果:
weight of all the variables : PaperlessBilling 0.389379 SeniorCitizen 0.246504 InternetService 0.209283 Partner 0.067855 StreamingMovies 0.054309 MultipleLines 0.042330 PaymentMethod 0.039134 MonthlyCharges 0.027180 StreamingTV -0.008606 gender -0.029547 tenure -0.034668 DeviceProtection -0.052690 OnlineBackup -0.143625 Dependents -0.209667 OnlineSecurity -0.245952 TechSupport -0.254740 Contract -0.729557 PhoneService -0.950555 dtype: float64