Gensim - 建立 LSI 和 HDP 主題模型

本章討論使用 Gensim 建立潛在語義索引 (LSI) 和分層狄利克雷過程 (HDP) 主題模型。

Gensim 中最早實現的主題建模演算法是潛在語義索引 (LSI)，也稱為潛在語義分析 (LSA)。它由 Scott Deerwester、Susan Dumais、George Furnas、Richard Harshman、Thomas Landaur、Karen Lochbaum 和 Lynn Streeter 於 1988 年獲得專利。

在本節中，我們將設定我們的 LSI 模型。這可以透過與設定 LDA 模型相同的方式完成。我們需要從gensim.models匯入 LSI 模型。

LSI 的作用

實際上，LSI 是一種 NLP 技術，尤其是在分散式語義中。它分析一組文件與其包含的術語之間的關係。如果我們談論它的工作原理，那麼它會從一大段文字中構建一個包含每個文件的單詞計數的矩陣。

構建完成後，為了減少行數，LSI 模型使用一種稱為奇異值分解 (SVD) 的數學技術。除了減少行數外，它還保留了列之間的相似性結構。

在矩陣中，行表示唯一的單詞，列表示每個文件。它基於分散式假設工作，即它假設含義相近的單詞會出現在相同型別的文字中。

使用 Gensim 實現

在這裡，我們將使用 LSI（潛在語義索引）從資料集中提取自然討論的主題。

載入資料集

我們將要使用的資料集是“20 個新聞組”的資料集，其中包含來自新聞報道各個部分的數千篇新聞文章。它在Sklearn資料集下可用。我們可以使用以下 Python 指令碼輕鬆下載：

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

讓我們使用以下指令碼檢視一些示例新聞：

newsgroups_train.data[:4]
["From: lerxst@wam.umd.edu (where's my thing)\nSubject: 
WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: 
University of Maryland, College Park\nLines: 15\n\n 
I was wondering if anyone out there could enlighten me on this car 
I saw\nthe other day. It was a 2-door sports car,
looked to be from the late 60s/\nearly 70s. It was called a Bricklin. 
The doors were really small. In addition,\nthe front bumper was separate from 
the rest of the body. This is \nall I know. If anyone can tellme a model name, 
engine specs, years\nof production, where this car is made, history, or 
whatever info you\nhave on this funky looking car, 
please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood 
Lerxst ----\n\n\n\n\n",

"From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: 
SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: 
SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: 
University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA 
fair number of brave souls who upgraded their SI clock oscillator have\nshared their 
experiences for this poll. Please send a brief message detailing\nyour experiences with 
the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat 
sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies 
are especially requested.\n\nI will be summarizing in the next two days, so please add 
to the network\nknowledge base if you have done the clock upgrade and haven't answered 
this\npoll. Thanks.\n\nGuy Kuo <guykuo@u.washington.edu>\n",

'From: twillis@ec.ecn.purdue.edu (Thomas E Willis)\nSubject: 
PB questions...\nOrganization: Purdue University Engineering Computer 
Network\nDistribution: usa\nLines: 36\n\nwell folks, my mac plus finally gave up the 
ghost this weekend after\nstarting life as a 512k way back in 1985. sooo, i\'m in the 
market for a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking into 
picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) 
somebody can answer:\n\n* does anybody know any dirt on when the next round of 
powerbook\nintroductions are expected? i\'d heard the 185c was supposed to make 
an\nappearence "this summer" but haven\'t heard anymore on it - and since i\ndon\'t 
have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has 
anybody heard rumors about price drops to the powerbook line like the\nones the duo\'s 
just went through recently?\n\n* what\'s the impression of the display on the 180? i 
could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don\'t 
really have\na feel for how much "better" the display is (yea, it looks great in 
the\nstore, but is that all "wow" or is it really that good?). could i solicit\nsome 
opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk 
size and money hit to get the active display? (i realize\nthis is a real subjective 
question, but i\'ve only played around with the\nmachines in a computer store breifly 
and figured the opinions of somebody\nwho actually uses the machine daily might prove 
helpful).\n\n* how well does hellcats perform? ;)\n\nthanks a bunch in advance for any 
info - if you could email, i\'ll post a\nsummary (news reading time is at a premium 
with finals just around the\ncorner... :( )\n--\nTom Willis \\ twillis@ecn.purdue.edu 
\\ Purdue Electrical 
Engineering\n---------------------------------------------------------------------------\
n"Convictions are more dangerous enemies of truth than lies." - F. W.\nNietzsche\n',

'From: jgreen@amber (Joe Green)\nSubject: Re: Weitek P9000 ?\nOrganization: Harris 
Computer Systems Division\nLines: 14\nDistribution: world\nNNTP-Posting-Host: 
amber.ssd.csd.harris.com\nX-Newsreader: TIN [version 1.1 PL9]\n\nRobert J.C. Kyanko 
(rob@rjck.UUCP) wrote:\n > abraxis@iastate.edu writes in article <
abraxis.734340159@class1.iastate.edu>:\n> > Anyone know about the Weitek P9000 
graphics chip?\n > As far as the low-level stuff goes, it looks pretty nice. It\'s 
got this\n > quadrilateral fill command that requires just the four
points.\n\nDo you have Weitek\'s address/phone number? I\'d like to get some 
information\nabout this chip.\n\n--\nJoe Green\t\t\t\tHarris 
Corporation\njgreen@csd.harris.com\t\t\tComputer Systems Division\n"The only thing that 
really scares me is a person with no sense of humor."\n\t\t\t\t\t\t-- Jonathan 
Winters\n']

先決條件

我們需要來自 NLTK 的停用詞和來自 Scapy 的英語模型。兩者都可以按如下方式下載：

import nltk;
nltk.download('stopwords')
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])

匯入必要的包

為了構建 LSI 模型，我們需要匯入以下必要的包：

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotlib.pyplot as plt

準備停用詞

現在我們需要匯入停用詞並使用它們：

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

清理文字

現在，藉助 Gensim 的simple_preprocess()，我們需要將每個句子標記化為單詞列表。我們還應該刪除標點符號和不必要的字元。為此，我們將建立一個名為sent_to_words()的函式：

def sent_to_words(sentences):
   for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))

構建二元語法和三元語法模型

眾所周知，二元語法是文件中經常一起出現的兩個單詞，三元語法是文件中經常一起出現的三個單詞。藉助 Gensim 的 Phrases 模型，我們可以做到這一點：

bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

過濾停用詞

接下來，我們需要過濾掉停用詞。除此之外，我們還將建立函式來建立二元語法、三元語法和進行詞形還原：

def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

構建主題模型的詞典和語料庫

我們現在需要構建詞典和語料庫。我們之前在示例中也這樣做過：

id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]

構建 LSI 主題模型

我們已經實現了訓練 LSI 模型所需的一切。現在，是時候構建 LSI 主題模型了。對於我們的實現示例，可以使用以下程式碼行完成：

lsi_model = gensim.models.lsimodel.LsiModel(
   corpus=corpus, id2word=id2word, num_topics=20,chunksize=100
)

實現示例

讓我們看看構建 LDA 主題模型的完整實現示例：

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
data_lemmatized = lemmatization(
   data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
)
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] 
#it will print the words with their frequencies.
lsi_model = gensim.models.lsimodel.LsiModel(
   corpus=corpus, id2word=id2word, num_topics=20,chunksize=100
)

現在我們可以使用上面建立的 LSI 模型來獲取主題。

檢視 LSI 模型中的主題

我們上面建立的 LSI 模型(lsi_model)可用於檢視文件中的主題。這可以使用以下指令碼完成：

pprint(lsi_model.print_topics())
doc_lsi = lsi_model[corpus]

輸出

[
   (0,
   '1.000*"ax" + 0.001*"_" + 0.000*"tm" + 0.000*"part" +    0.000*"pne" + '
   '0.000*"biz" + 0.000*"mbs" + 0.000*"end" + 0.000*"fax" + 0.000*"mb"'),
   (1,
   '0.239*"say" + 0.222*"file" + 0.189*"go" + 0.171*"know" + 0.169*"people" + '
   '0.147*"make" + 0.140*"use" + 0.135*"also" + 0.133*"see" + 0.123*"think"')
]

分層狄利克雷過程 (HPD)

LDA 和 LSI 等主題模型有助於總結和組織無法手動分析的大型文字檔案。除了 LDA 和 LSI 之外，Gensim 中另一個強大的主題模型是 HDP（分層狄利克雷過程）。它基本上是用於對分組資料進行無監督分析的混合成員模型。與 LDA（其有限對應物）不同，HDP 從資料中推斷主題數量。

使用 Gensim 實現

為了在 Gensim 中實現 HDP，我們需要訓練語料庫和詞典（如在上面實現 LDA 和 LSI 主題模型的示例中所做的那樣）我們可以從 gensim.models.HdpModel 匯入 HDP 主題模型。在這裡，我們還將在 20 個新聞組資料上實現 HDP 主題模型，步驟也相同。

對於我們的語料庫和詞典（在上面 LSI 和 LDA 模型的示例中建立），我們可以按如下方式匯入 HdpModel：

Hdp_model = gensim.models.hdpmodel.HdpModel(corpus=corpus, id2word=id2word)

檢視 LSI 模型中的主題

HDP 模型(Hdp_model)可用於檢視文件中的主題。這可以使用以下指令碼完成：

pprint(Hdp_model.print_topics())

輸出

[
   (0,
   '0.009*line + 0.009*write + 0.006*say + 0.006*article + 0.006*know + '
   '0.006*people + 0.005*make + 0.005*go + 0.005*think + 0.005*be'),
   (1,
   '0.016*line + 0.011*write + 0.008*article + 0.008*organization + 0.006*know '
   '+ 0.006*host + 0.006*be + 0.005*get + 0.005*use + 0.005*say'),
   (2,
   '0.810*ax + 0.001*_ + 0.000*tm + 0.000*part + 0.000*mb + 0.000*pne + '
   '0.000*biz + 0.000*end + 0.000*wwiz + 0.000*fax'),
   (3,
   '0.015*line + 0.008*write + 0.007*organization + 0.006*host + 0.006*know + '
   '0.006*article + 0.005*use + 0.005*thank + 0.004*get + 0.004*problem'),
   (4,
   '0.004*line + 0.003*write + 0.002*believe + 0.002*think + 0.002*article + '
   '0.002*belief + 0.002*say + 0.002*see + 0.002*look + 0.002*organization'),
   (5,
   '0.005*line + 0.003*write + 0.003*organization + 0.002*article + 0.002*time '
   '+ 0.002*host + 0.002*get + 0.002*look + 0.002*say + 0.001*number'),
   (6,
   '0.003*line + 0.002*say + 0.002*write + 0.002*go + 0.002*gun + 0.002*get + '
   '0.002*organization + 0.002*bill + 0.002*article + 0.002*state'),
   (7,
   '0.003*line + 0.002*write + 0.002*article + 0.002*organization + 0.001*none '
   '+ 0.001*know + 0.001*say + 0.001*people + 0.001*host + 0.001*new'),
   (8,
   '0.004*line + 0.002*write + 0.002*get + 0.002*team + 0.002*organization + '
   '0.002*go + 0.002*think + 0.002*know + 0.002*article + 0.001*well'),
   (9,
   '0.004*line + 0.002*organization + 0.002*write + 0.001*be + 0.001*host + '
   '0.001*article + 0.001*thank + 0.001*use + 0.001*work + 0.001*run'),
   (10,
   '0.002*line + 0.001*game + 0.001*write + 0.001*get + 0.001*know + '
   '0.001*thing + 0.001*think + 0.001*article + 0.001*help + 0.001*turn'),
   (11,
   '0.002*line + 0.001*write + 0.001*game + 0.001*organization + 0.001*say + '
   '0.001*host + 0.001*give + 0.001*run + 0.001*article + 0.001*get'),
   (12,
   '0.002*line + 0.001*write + 0.001*know + 0.001*time + 0.001*article + '
   '0.001*get + 0.001*think + 0.001*organization + 0.001*scope + 0.001*make'),
   (13,
   '0.002*line + 0.002*write + 0.001*article + 0.001*organization + 0.001*make '
   '+ 0.001*know + 0.001*see + 0.001*get + 0.001*host + 0.001*really'),
   (14,
   '0.002*write + 0.002*line + 0.002*know + 0.001*think + 0.001*say + '
   '0.001*article + 0.001*argument + 0.001*even + 0.001*card + 0.001*be'),
   (15,
   '0.001*article + 0.001*line + 0.001*make + 0.001*write + 0.001*know + '
   '0.001*say + 0.001*exist + 0.001*get + 0.001*purpose + 0.001*organization'),
   (16,
   '0.002*line + 0.001*write + 0.001*article + 0.001*insurance + 0.001*go + '
   '0.001*be + 0.001*host + 0.001*say + 0.001*organization + 0.001*part'),
   (17,
   '0.001*line + 0.001*get + 0.001*hit + 0.001*go + 0.001*write + 0.001*say + '
   '0.001*know + 0.001*drug + 0.001*see + 0.001*need'),
   (18,
   '0.002*option + 0.001*line + 0.001*flight + 0.001*power + 0.001*software + '
   '0.001*write + 0.001*add + 0.001*people + 0.001*organization + 0.001*module'),
   (19,
   '0.001*shuttle + 0.001*line + 0.001*roll + 0.001*attitude + 0.001*maneuver + '
   '0.001*mission + 0.001*also + 0.001*orbit + 0.001*produce + 0.001*frequency')
]

列印頁面