Python - 文字分類

很多時候，我們需要根據一些預定義的標準將可用的文字分類到不同的類別中。NLTK 提供了作為各種語料庫一部分的此類功能。在下面的示例中，我們檢視電影評論語料庫並檢查可用的分類。

# Lets See how the movies are classified
from nltk.corpus import movie_reviews

all_cats = []
for w in movie_reviews.categories():
    all_cats.append(w.lower())
print(all_cats)

執行上述程式後，我們將得到以下輸出：

['neg', 'pos']

現在讓我們來看一下包含正面評價的檔案之一的內容。此檔案中的句子已被分詞，我們列印前四句以檢視示例。

from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize
fields = movie_reviews.fileids()

sample = movie_reviews.raw("pos/cv944_13521.txt")

token = sent_tokenize(sample)
for lines in range(4):
    print(token[lines])

執行上述程式後，我們將得到以下輸出：

meteor threat set to blow away all volcanoes & twisters !
summer is here again !
this season could probably be the most ambitious = season this decade with hollywood churning out films 
like deep impact , = godzilla , the x-files , armageddon , the truman show , 
all of which has but = one main aim , to rock the box office .
leading the pack this summer is = deep impact , one of the first few film 
releases from the = spielberg-katzenberg-geffen's dreamworks production company .

接下來，我們對這些檔案中的每個單詞進行分詞，並使用 nltk 中的 FreqDist 函式查詢最常見的單詞。

import nltk
from nltk.corpus import movie_reviews
fields = movie_reviews.fileids()

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(10))

執行上述程式後，我們將得到以下輸出：

[(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576), 
(of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)]

列印頁面