Gensim - 建立LDA主題模型

本章將幫助您學習如何在 Gensim 中建立潛在狄利克雷分配 (LDA) 主題模型。

從大量文字中自動提取有關主題的資訊是自然語言處理 (NLP) 的主要應用之一。大量文字可能是來自酒店評論、推文、Facebook帖子、任何其他社交媒體渠道的帖子、電影評論、新聞報道、使用者反饋、電子郵件等的饋送。

在這個數字時代，瞭解人們/客戶在談論什麼，理解他們的觀點和問題，對於企業、政治活動和管理人員來說可能非常有價值。但是，是否可以手動閱讀如此大量的文字，然後從中提取主題資訊呢？

不，這不可能。它需要一個能夠自動讀取這些大量文字文件並自動從中提取所需資訊/討論主題的演算法。

LDA 的作用

LDA 對主題建模的方法是將文件中的文字分類到特定主題。LDA 建立了作為狄利克雷分佈建模的模型：

每個文件的主題模型，以及
每個主題的詞語模型

在提供 LDA 主題模型演算法後，為了獲得良好的主題-關鍵詞分佈組合，它會重新排列：

文件中的主題分佈，以及
主題中的關鍵詞分佈

在處理過程中，LDA 做出的一些假設包括：

每個文件都被建模為主題的多項式分佈。
每個主題都被建模為詞語的多項式分佈。
我們必須選擇正確的語料庫資料，因為 LDA 假設每個文字塊都包含相關的詞語。
LDA 還假設文件是由主題混合產生的。

使用 Gensim 實現

在這裡，我們將使用 LDA（潛在狄利克雷分配）從資料集中提取自然討論的主題。

載入資料集

我們將要使用的資料集是“20 個新聞組”的資料集，其中包含來自新聞報道各個部分的數千篇新聞文章。它在 Sklearn 資料集中可用。我們可以輕鬆地使用以下 Python 指令碼下載：

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

讓我們使用以下指令碼檢視一些示例新聞：

newsgroups_train.data[:4]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: 
WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: 
University of Maryland, College Park\nLines: 
15\n\n I was wondering if anyone out there could enlighten me on this car 
I saw\nthe other day. It was a 2-door sports car, looked to be from the 
late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. 
In addition,\nthe front bumper was separate from the rest of the body. 
This is \nall I know. If anyone can tellme a model name, 
engine specs, years\nof production, where this car is made, history, or 
whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,
\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",

"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final 
Call\nSummary: Final call for SI clock reports\nKeywords: 
SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: 
University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA 
fair number of brave souls who upgraded their SI clock oscillator have\nshared their 
experiences for this poll. Please send a brief message detailing\nyour experiences with 
the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat 
sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies 
are especially requested.\n\nI will be summarizing in the next two days, so please add 
to the network\nknowledge base if you have done the clock upgrade and haven't answered 
this\npoll. Thanks.\n\nGuy Kuo <;guykuo@u.washington.edu>\n",

'From: twillis@ec.ecn.purdue.edu (Thomas E Willis)\nSubject: 
PB questions...\nOrganization: Purdue University Engineering 
Computer Network\nDistribution: usa\nLines: 36\n\nwell folks, 
my mac plus finally gave up the ghost this weekend after\nstarting 
life as a 512k way back in 1985. sooo, i\'m in the market for 
a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking 
into picking up a powerbook 160 or maybe 180 and have a bunch\nof 
questions that (hopefully) somebody can answer:\n\n* does anybody 
know any dirt on when the next round of powerbook\nintroductions 
are expected? i\'d heard the 185c was supposed to make an\nappearence 
"this summer" but haven\'t heard anymore on it - and since i\ndon\'t 
have access to macleak, i was wondering if anybody out there had\nmore 
info...\n\n* has anybody heard rumors about price drops to the powerbook 
line like the\nones the duo\'s just went through recently?\n\n* what\'s 
the impression of the display on the 180? i could probably swing\na 180 
if i got the 80Mb disk rather than the 120, but i don\'t really have\na 
feel for how much "better" the display is (yea, it looks great in the\nstore, 
but is that all "wow" or is it really that good?). could i solicit\nsome 
opinions of people who use the 160 and 180 day-to-day on if its
worth\ntaking the disk size and money hit to get the active display? 
(i realize\nthis is a real subjective question, but i\'ve only played around 
with the\nmachines in a computer store breifly and figured the opinions 
of somebody\nwho actually uses the machine daily might prove helpful).\n\n* 
how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info - 
if you could email, i\'ll post a\nsummary (news reading time is at a premium 
with finals just around the\ncorner... :
( )\n--\nTom Willis \\ twillis@ecn.purdue.edu \\ Purdue Electrical 
Engineering\n---------------------------------------------------------------------------\
n"Convictions are more dangerous enemies of truth than lies." - F. W.\nNietzsche\n',

'From: jgreen@amber (Joe Green)\nSubject: Re: Weitek P9000 ?\nOrganization: 
Harris Computer Systems Division\nLines: 14\nDistribution: world\nNNTP-Posting-Host: 
amber.ssd.csd.harris.com\nX-Newsreader: TIN [version 1.1 PL9]\n\nRobert 
J.C. Kyanko (rob@rjck.UUCP) wrote:\n >abraxis@iastate.edu writes in article 
<abraxis.734340159@class1.iastate.edu >:\n> > Anyone know about the 
Weitek P9000 graphics chip?\n > As far as the low-level stuff goes, it looks 
pretty nice. It\'s got this\n> quadrilateral fill command that requires just 
the four points.\n\nDo you have Weitek\'s address/phone number? I\'d like to get 
some information\nabout this chip.\n\n--\nJoe Green\t\t\t\tHarris 
Corporation\njgreen@csd.harris.com\t\t\tComputer Systems Division\n"The only 
thing that really scares me is a person with no sense of humor.
"\n\t\t\t\t\t\t-- Jonathan Winters\n']

先決條件

我們需要來自 NLTK 的停用詞和來自 Scapy 的英語模型。兩者都可以按如下方式下載：

import nltk;
nltk.download('stopwords')
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])

匯入必要的包

為了構建 LDA 模型，我們需要匯入以下必要的包：

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

準備停用詞

現在，我們需要匯入停用詞並使用它們：

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

清理文字

現在，藉助 Gensim 的 simple_preprocess()，我們需要將每個句子標記化為詞語列表。我們還應該刪除標點符號和不必要的字元。為此，我們將建立一個名為 sent_to_words() 的函式：

def sent_to_words(sentences):
   for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))

構建二元語法和三元語法模型

眾所周知，二元語法是文件中經常一起出現的兩個詞語，三元語法是文件中經常一起出現的三個詞語。藉助 Gensim 的 Phrases 模型，我們可以做到這一點：

bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

過濾掉停用詞

接下來，我們需要過濾掉停用詞。除此之外，我們還將建立函式來建立二元語法、三元語法和進行詞形還原：

def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
     doc = nlp(" ".join(sent))
     texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

為主題模型構建詞典和語料庫

現在我們需要構建詞典和語料庫。我們之前在示例中也做過：

id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]

構建 LDA 主題模型

我們已經實現了訓練 LDA 模型所需的一切。現在是構建 LDA 主題模型的時候了。對於我們的實現示例，可以使用以下幾行程式碼完成：

lda_model = gensim.models.ldamodel.LdaModel(
   corpus=corpus, id2word=id2word, num_topics=20, random_state=100, 
   update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)

實現示例

讓我們看看構建 LDA 主題模型的完整實現示例：

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[
   'NOUN', 'ADJ', 'VERB', 'ADV'
])
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] 
#it will print the words with their frequencies.
lda_model = gensim.models.ldamodel.LdaModel(
   corpus=corpus, id2word=id2word, num_topics=20, random_state=100, 
   update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)

我們現在可以使用上面建立的 LDA 模型來獲取主題，計算模型困惑度。

列印頁面