Gensim - 建立詞袋 (BoW) 語料庫

我們已經瞭解瞭如何從文件列表和文字檔案（從一個或多個）建立詞典。現在，在本節中，我們將建立一個詞袋 (BoW) 語料庫。為了使用 Gensim，這是我們需要熟悉的最重要的物件之一。基本上，它是包含每個文件中單詞 ID 及其頻率的語料庫。

建立 BoW 語料庫

如上所述，在 Gensim 中，語料庫包含每個文件中單詞 ID 及其頻率。我們可以從簡單的文件列表和文字檔案建立 BoW 語料庫。我們需要做的是，將標記化的單詞列表傳遞給名為 Dictionary.doc2bow() 的物件。因此，首先，讓我們從使用簡單的文件列表建立 BoW 語料庫開始。

從簡單的句子列表

在以下示例中，我們將從包含三個句子的簡單列表建立 BoW 語料庫。

首先，我們需要匯入所有必要的包，如下所示：

import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess

現在提供包含句子的列表。我們的列表中有三個句子：

doc_list = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]

接下來，對句子進行標記化，如下所示：

doc_tokenized = [simple_preprocess(doc) for doc in doc_list]

建立一個 corpora.Dictionary() 物件，如下所示：

dictionary = corpora.Dictionary()

現在將這些標記化的句子傳遞給 dictionary.doc2bow() 物件，如下所示：

BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]

最後，我們可以列印詞袋語料庫：

print(BoW_corpus)

輸出

[
   [(0, 1), (1, 1), (2, 1), (3, 1)], 
   [(2, 1), (3, 1), (4, 2)], [(0, 2), (3, 3), (5, 2), (6, 1), (7, 2), (8, 1)]
]

以上輸出顯示，ID 為 0 的單詞在第一個文件中出現一次（因為我們在輸出中獲得了 (0,1)），依此類推。

以上輸出對於人類來說有點難以閱讀。我們也可以將這些 ID 轉換為單詞，但為此我們需要使用我們的詞典進行轉換，如下所示：

id_words = [[(dictionary[id], count) for id, count in line] for line in BoW_corpus]
print(id_words)

輸出

[
   [('are', 1), ('hello', 1), ('how', 1), ('you', 1)], 
   [('how', 1), ('you', 1), ('do', 2)], 
   [('are', 2), ('you', 3), ('doing', 2), ('hey', 1), ('what', 2), ('yes', 1)]
]

現在以上輸出對人類來說更容易理解了。

完整的實現示例

import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess
doc_list = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]
doc_tokenized = [simple_preprocess(doc) for doc in doc_list]
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
print(BoW_corpus)
id_words = [[(dictionary[id], count) for id, count in line] for line in BoW_corpus]
print(id_words)

從文字檔案

在以下示例中，我們將從文字檔案建立 BoW 語料庫。為此，我們將前面示例中使用的文件儲存在名為 doc.txt 的文字檔案中。

Gensim 將逐行讀取檔案，並使用 simple_preprocess 逐行處理。這樣，它不需要一次將整個檔案載入到記憶體中。

實現示例

首先，匯入所需的和必要的包，如下所示：

import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

接下來，以下程式碼行將讀取 doc.txt 中的文件並將其標記化：

doc_tokenized = [
   simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’)
]
dictionary = corpora.Dictionary()

現在我們需要將這些標記化的單詞傳遞到 dictionary.doc2bow() 物件中（如前例所示）

BoW_corpus = [
   dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized
]
print(BoW_corpus)

輸出

[
   [(9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], 
   [
      (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), 
      (22, 1), (23, 1), (24, 1)
   ], 
   [
      (23, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), 
      (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1)
   ], 
   [(3, 1), (18, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)], 
   [
      (18, 1), (27, 1), (31, 2), (32, 1), (38, 1), (41, 1), (43, 1), 
      (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1)
   ]
]

doc.txt 檔案包含以下內容：

CNTK，以前稱為計算網路工具包，是一個免費、易於使用、開源的商用級工具包，使我們能夠訓練深度學習演算法，使其像人腦一樣學習。

您可以在 tutorialspoint.com 上找到其免費教程，該網站還免費提供有關人工智慧、深度學習、機器學習等技術的最佳技術教程。

完整的實現示例

import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
doc_tokenized = [
   simple_preprocess(line, deacc =True) for line in open(‘doc.txt’, encoding=’utf-8’)
]
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
print(BoW_corpus)

儲存和載入 Gensim 語料庫

我們可以使用以下指令碼儲存語料庫：

corpora.MmCorpus.serialize(‘/Users/Desktop/BoW_corpus.mm’, bow_corpus)

#提供語料庫的路徑和名稱。語料庫的名稱為 BoW_corpus，我們將其儲存為矩陣市場格式。

類似地，我們可以使用以下指令碼載入儲存的語料庫：

corpus_load = corpora.MmCorpus(‘/Users/Desktop/BoW_corpus.mm’)
for line in corpus_load:
print(line)

列印頁面