- Gensim 教程
- Gensim - 首頁
- Gensim - 簡介
- Gensim - 入門
- Gensim - 文件和語料庫
- Gensim - 向量和模型
- Gensim - 建立詞典
- 建立詞袋 (BoW) 語料庫
- Gensim - 變換
- Gensim - 建立TF-IDF矩陣
- Gensim - 主題建模
- Gensim - 建立LDA主題模型
- Gensim - 使用LDA主題模型
- Gensim - 建立LDA Mallet模型
- Gensim - 文件和LDA模型
- Gensim - 建立LSI和HDP主題模型
- Gensim - 開發詞嵌入
- Gensim - Doc2Vec模型
- Gensim 有用資源
- Gensim - 快速指南
- Gensim - 有用資源
- Gensim - 討論
Gensim - 建立LDA主題模型
本章將幫助您學習如何在 Gensim 中建立潛在狄利克雷分配 (LDA) 主題模型。
從大量文字中自動提取有關主題的資訊是自然語言處理 (NLP) 的主要應用之一。大量文字可能是來自酒店評論、推文、Facebook帖子、任何其他社交媒體渠道的帖子、電影評論、新聞報道、使用者反饋、電子郵件等的饋送。
在這個數字時代,瞭解人們/客戶在談論什麼,理解他們的觀點和問題,對於企業、政治活動和管理人員來說可能非常有價值。但是,是否可以手動閱讀如此大量的文字,然後從中提取主題資訊呢?
不,這不可能。它需要一個能夠自動讀取這些大量文字文件並自動從中提取所需資訊/討論主題的演算法。
LDA 的作用
LDA 對主題建模的方法是將文件中的文字分類到特定主題。LDA 建立了作為狄利克雷分佈建模的模型:
- 每個文件的主題模型,以及
- 每個主題的詞語模型
在提供 LDA 主題模型演算法後,為了獲得良好的主題-關鍵詞分佈組合,它會重新排列:
- 文件中的主題分佈,以及
- 主題中的關鍵詞分佈
在處理過程中,LDA 做出的一些假設包括:
- 每個文件都被建模為主題的多項式分佈。
- 每個主題都被建模為詞語的多項式分佈。
- 我們必須選擇正確的語料庫資料,因為 LDA 假設每個文字塊都包含相關的詞語。
- LDA 還假設文件是由主題混合產生的。
使用 Gensim 實現
在這裡,我們將使用 LDA(潛在狄利克雷分配)從資料集中提取自然討論的主題。
載入資料集
我們將要使用的資料集是“20 個新聞組”的資料集,其中包含來自新聞報道各個部分的數千篇新聞文章。它在 Sklearn 資料集中可用。我們可以輕鬆地使用以下 Python 指令碼下載:
from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset='train')
讓我們使用以下指令碼檢視一些示例新聞:
newsgroups_train.data[:4]
["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks, \n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n", "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.\n\nGuy Kuo <;guykuo@u.washington.edu>\n", 'From: twillis@ec.ecn.purdue.edu (Thomas E Willis)\nSubject: PB questions...\nOrganization: Purdue University Engineering Computer Network\nDistribution: usa\nLines: 36\n\nwell folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985. sooo, i\'m in the market for a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected? i\'d heard the 185c was supposed to make an\nappearence "this summer" but haven\'t heard anymore on it - and since i\ndon\'t have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo\'s just went through recently?\n\n* what\'s the impression of the display on the 180? i could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don\'t really have\na feel for how much "better" the display is (yea, it looks great in the\nstore, but is that all "wow" or is it really that good?). could i solicit\nsome opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk size and money hit to get the active display? (i realize\nthis is a real subjective question, but i\'ve only played around with the\nmachines in a computer store breifly and figured the opinions of somebody\nwho actually uses the machine daily might prove helpful).\n\n* how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info - if you could email, i\'ll post a\nsummary (news reading time is at a premium with finals just around the\ncorner... : ( )\n--\nTom Willis \\ twillis@ecn.purdue.edu \\ Purdue Electrical Engineering\n---------------------------------------------------------------------------\ n"Convictions are more dangerous enemies of truth than lies." - F. W.\nNietzsche\n', 'From: jgreen@amber (Joe Green)\nSubject: Re: Weitek P9000 ?\nOrganization: Harris Computer Systems Division\nLines: 14\nDistribution: world\nNNTP-Posting-Host: amber.ssd.csd.harris.com\nX-Newsreader: TIN [version 1.1 PL9]\n\nRobert J.C. Kyanko (rob@rjck.UUCP) wrote:\n >abraxis@iastate.edu writes in article <abraxis.734340159@class1.iastate.edu >:\n> > Anyone know about the Weitek P9000 graphics chip?\n > As far as the low-level stuff goes, it looks pretty nice. It\'s got this\n> quadrilateral fill command that requires just the four points.\n\nDo you have Weitek\'s address/phone number? I\'d like to get some information\nabout this chip.\n\n--\nJoe Green\t\t\t\tHarris Corporation\njgreen@csd.harris.com\t\t\tComputer Systems Division\n"The only thing that really scares me is a person with no sense of humor. "\n\t\t\t\t\t\t-- Jonathan Winters\n']
先決條件
我們需要來自 NLTK 的停用詞和來自 Scapy 的英語模型。兩者都可以按如下方式下載:
import nltk;
nltk.download('stopwords')
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
匯入必要的包
為了構建 LDA 模型,我們需要匯入以下必要的包:
import re import numpy as np import pandas as pd from pprint import pprint import gensim import gensim.corpora as corpora from gensim.utils import simple_preprocess from gensim.models import CoherenceModel import spacy import pyLDAvis import pyLDAvis.gensim import matplotlib.pyplot as plt
準備停用詞
現在,我們需要匯入停用詞並使用它們:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
清理文字
現在,藉助 Gensim 的 simple_preprocess(),我們需要將每個句子標記化為詞語列表。我們還應該刪除標點符號和不必要的字元。為此,我們將建立一個名為 sent_to_words() 的函式:
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))
構建二元語法和三元語法模型
眾所周知,二元語法是文件中經常一起出現的兩個詞語,三元語法是文件中經常一起出現的三個詞語。藉助 Gensim 的 Phrases 模型,我們可以做到這一點:
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) trigram = gensim.models.Phrases(bigram[data_words], threshold=100) bigram_mod = gensim.models.phrases.Phraser(bigram) trigram_mod = gensim.models.phrases.Phraser(trigram)
過濾掉停用詞
接下來,我們需要過濾掉停用詞。除此之外,我們還將建立函式來建立二元語法、三元語法和進行詞形還原:
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
為主題模型構建詞典和語料庫
現在我們需要構建詞典和語料庫。我們之前在示例中也做過:
id2word = corpora.Dictionary(data_lemmatized) texts = data_lemmatized corpus = [id2word.doc2bow(text) for text in texts]
構建 LDA 主題模型
我們已經實現了訓練 LDA 模型所需的一切。現在是構建 LDA 主題模型的時候了。對於我們的實現示例,可以使用以下幾行程式碼完成:
lda_model = gensim.models.ldamodel.LdaModel( corpus=corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True )
實現示例
讓我們看看構建 LDA 主題模型的完整實現示例:
import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
[trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[
'NOUN', 'ADJ', 'VERB', 'ADV'
])
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]]
#it will print the words with their frequencies.
lda_model = gensim.models.ldamodel.LdaModel(
corpus=corpus, id2word=id2word, num_topics=20, random_state=100,
update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)
我們現在可以使用上面建立的 LDA 模型來獲取主題,計算模型困惑度。