大資料分析 - 文字分析

本章將使用本書第一部分中抓取的資料。資料包含描述自由職業者個人資料及其以美元計費的小時費率的文字。下一節的目標是擬合一個模型，根據自由職業者的技能預測其每小時工資。

以下程式碼展示瞭如何將此案例中包含使用者技能的原始文字轉換為詞袋矩陣。為此，我們使用名為tm的R庫。這意味著對於語料庫中的每個單詞，我們建立一個變數，其中包含每個變數的出現次數。

library(tm)
library(data.table)  

source('text_analytics/text_analytics_functions.R') 
data = fread('text_analytics/data/profiles.txt') 
rate = as.numeric(data$rate) 
keep = !is.na(rate) 
rate = rate[keep]  

### Make bag of words of title and body 
X_all = bag_words(data$user_skills[keep]) 
X_all = removeSparseTerms(X_all, 0.999) 
X_all 

# <<DocumentTermMatrix (documents: 389, terms: 1422)>> 
#   Non-/sparse entries: 4057/549101 
# Sparsity           : 99% 
# Maximal term length: 80 
# Weighting          : term frequency - inverse document frequency (normalized) (tf-idf) 

### Make a sparse matrix with all the data 
X_all <- as_sparseMatrix(X_all)

現在我們已經將文字表示為稀疏矩陣，我們可以擬合一個模型來提供稀疏解。對於這種情況，一個不錯的選擇是使用LASSO（最小絕對收縮和選擇運算元）。這是一個能夠選擇最相關特徵來預測目標的迴歸模型。

train_inx = 1:200
X_train = X_all[train_inx, ] 
y_train = rate[train_inx]  
X_test = X_all[-train_inx, ] 
y_test = rate[-train_inx]  

# Train a regression model 
library(glmnet) 
fit <- cv.glmnet(x = X_train, y = y_train,  
   family = 'gaussian', alpha = 1,  
   nfolds = 3, type.measure = 'mae') 
plot(fit)  

# Make predictions 
predictions = predict(fit, newx = X_test) 
predictions = as.vector(predictions[,1]) 
head(predictions)  

# 36.23598 36.43046 51.69786 26.06811 35.13185 37.66367 
# We can compute the mean absolute error for the test data 
mean(abs(y_test - predictions)) 
# 15.02175

現在我們有一個模型，可以根據一組技能預測自由職業者的每小時工資。如果收集更多資料，模型的效能將會提高，但實現此流程的程式碼將保持不變。

列印頁面