使用 Python 中的 NLTK 進行停用詞的詞性標註？

自然語言處理背後的主要思想是，機器可以在一定程度上無需人工干預地進行某種形式的分析或處理，例如理解文字的部分含義或試圖表達的內容。

在處理文字時，計算機需要從文字中過濾掉無用或不太重要的資料（單詞）。在 NLTK 中，無用詞（資料）被稱為停用詞。

安裝所需的庫

首先，您需要 nltk 庫，只需在您的終端中執行以下命令即可

$pip install nltk

因此，我們將刪除這些停用詞，以便它們不會佔用我們資料庫的空間或佔用寶貴的處理時間。

您可以建立您自己的單詞列表，這些單詞您可能認為是停用詞。預設情況下，NLTK 包含一些它們認為是停用詞的片語，您可以透過 NLTK 語料庫訪問它，方法是

>>> import nltk
>>> from nltk.corpus import stopwords

以下是 NLTK 停用詞的列表

>>> set(stopwords.words('english'))
{'not', 'other', 'shan', "hadn't", 'she', 'did', 'through', 'and', 'does', "that'll", "weren't", 'your', "should've", "hasn't", 'myself', 'should', 'because', 'wasn', 'what', 'to', 'this', 'was', 'more', 'y', 'again', "needn't", 'into', 'above', 'themselves', 'd', "won't", 'during', 'haven', 'both', "shan't", 'their', 'on', 'hadn', 'up', 'once', 'its', 'against', 'before', 't', 'while', 'needn', 'doing', "don't", 'yourselves', 'until', 'is', 'all', 's', 'will', "you've", 'being', 'under', 'they', 'ours', 'wouldn', 'of', 'didn', 'below', 'just', 'ma', 'yours', "you'll", 'mightn', 'where', 'are', 'that', 'those', 'most', 'them', 'if', 'you', "shouldn't", 'off', 'for', 'her', 'such', 'now', 'than', 're', 'no', 'm', 'or', "aren't", 'further', 'here', "wasn't", 'after', "haven't", 'my', 'himself', 'at', 'had', 'yourself', 'by', 'weren', 'only', 'have', 'we', 'do', 'same', "isn't", 'herself', 'll', 'down', 'then', 'why', 'own', 'him', 'so', 'having', 'nor', 'isn', 'few', 'how', 'each', 'there', 'with', 'couldn', 'about', 'very', 'am', 'me', "didn't", "doesn't", 'which', "she's", 'doesn', 'were', 'he', 'in', "mightn't", 'when', 'our', 'who', 'his', "couldn't", 'the', "you'd", 'be', 'hers', 'hasn', 'between', 'it', 'mustn', 'but', 'out', 'can', "wouldn't", 'ourselves', 'whom', 'been', 'these', 'aren', 'over', 'itself', 'a', 'i', 'too', 'theirs', 'some', "you're", 'as', 'won', "it's", 'from', 'o', 'don', 'any', 've', 'ain', 'has', 'an', "mustn't", 'shouldn'}

下面是一個完整的程式，它將演示如何使用停用詞從文字中刪除停用詞

示例程式碼

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "Python is a powerful high-level, object-oriented programming language created by Guido van Rossum."\
"It has simple easy-to-use syntax, making it the perfect language for someone trying to learn computer programming for the first time."\
"This is a comprehensive guide on how to get started in Python, why you should learn it and how you can learn it. However, if you knowledge "\
"of other programming languages and want to quickly get started with Python."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

輸出

文字輸出：無過濾（帶停用詞）

['Python', 'is', 'a', 'powerful', 'high-level', ',', 'object-oriented', 'programming', 'language', 'created', 'by', 'Guido', 'van', 'Rossum.It', 'has', 'simple', 'easy-to-use', 'syntax', ',', 'making', 'it', 'the', 'perfect', 'language', 'for', 'someone', 'trying', 'to', 'learn', 'computer', 'programming', 'for', 'the', 'first', 'time.This', 'is', 'a', 'comprehensive', 'guide', 'on', 'how', 'to', 'get', 'started', 'in', 'Python', ',', 'why', 'you', 'should', 'learn', 'it', 'and', 'how', 'you', 'can', 'learn', 'it', '.', 'However', ',', 'if', 'you', 'knowledge', 'of', 'other', 'programming', 'languages', 'and', 'want', 'to', 'quickly', 'get', 'started', 'with', 'Python', '.']

文字輸出：有過濾（刪除停用詞）

['Python', 'powerful', 'high-level', ',', 'object-oriented', 'programming', 'language', 'created', 'Guido', 'van', 'Rossum.It', 'simple', 'easy-to-use', 'syntax', ',', 'making', 'perfect', 'language', 'someone', 'trying', 'learn', 'computer', 'programming', 'first', 'time.This', 'comprehensive', 'guide', 'get', 'started', 'Python', ',', 'learn', 'learn', '.', 'However', ',', 'knowledge', 'programming', 'languages', 'want', 'quickly', 'get', 'started', 'Python', '.']

Nitya Raut

更新於： 2019年7月30日

212 次檢視

開啟您的職業生涯

透過完成課程獲得認證

立即開始