Python 中分詞的 5 種簡單方法

分詞是將字串分割成標記或“更小片段”的過程。在自然語言處理 (NLP) 的上下文中，標記通常是單詞、標點符號和數字。分詞是許多 NLP 任務的重要預處理步驟，因為它允許您處理單個單詞和符號，而不是原始文字。

在本文中，我們將介紹在 Python 中執行分詞的五種方法。我們將從最簡單的方法開始，使用 split() 函式，然後繼續使用 nltk、re、string 和 shlex 等庫和模組進行更高階的技術。

使用 split() 方法

split() 方法是 Python 的 str 類的一個內建函式，它允許您根據指定的定界符將字串分割成子字串列表。以下是如何使用它的示例：

text = "This is a sample text"
tokens = text.split(" ")
print(tokens)

此程式碼將在空格字元上分割字串 text，生成的標記將是

['This', 'is', 'a', 'sample', 'text'].

您還可以透過將字串列表傳遞給 split() 方法來指定多個定界符。例如：

text = "This is a sample, text with punctuation!"
tokens = text.split([" ", ",", "!"])
print(tokens)

這將在空格、逗號和感嘆號上分割字串 text，生成的標記為 ['This', 'is', 'a', 'sample', '', 'text', 'with', 'punctuation', '']. 請注意，定界符也作為空字串包含在標記列表中。

split() 方法的一個限制是它只允許您根據一組固定的定界符分割字串。如果您想根據更復雜的模式（例如單詞或數字）分割字串，則需要使用更高階的技術。

使用 nltk 庫

自然語言工具包 (nltk) 是一個流行的 Python 庫，用於處理人類語言資料。它提供了幾個分詞函式，可用於根據各種標準將字串分割成標記。

要使用 nltk 庫，您需要先安裝它。您可以透過執行以下命令來執行此操作：

pip install nltk

安裝 nltk 後，您可以使用 word_tokenize() 函式根據單詞邊界將字串分割成標記：

import nltk
text = "This is a sample text"
tokens = nltk.word_tokenize(text)
print(tokens)

這將產生與上面 split() 方法相同的結果。

nltk 庫還提供了一些其他的分詞函式，例如 sent_tokenize()，它將文字分詞成句子。

示例

讓我們看一個例子：

from nltk.tokenize import sent_tokenize

# Define the text to be tokenized
text = "This is an example sentence for tokenization. And this is another sentence"

# Tokenize the text into sentences
sentences = sent_tokenize(text)

print(sentences)

輸出

這將輸出一個句子列表：

['This is an example sentence for tokenization.', 'And this is another sentence']

示例

我們還可以使用 nltk.tokenize 模組中的 word_tokenize() 方法對文字進行分詞，如下所示：

from nltk.tokenize import word_tokenize
# Define the text to be tokenized
text = "This is an example sentence for tokenization."
# Tokenize the text into words
words = word_tokenize(text)
print(words)

輸出

這也會輸出一個單詞列表：

['This', 'is', 'an', 'example', 'sentence', 'for', 'tokenization', '.']

如您所見，word_tokenize() 方法將文字分詞成單個單詞，就像 nltk.word_tokenize() 方法一樣。

示例

NLTK 庫還提供了一個名為 TweetTokenizer 的類，該類專門用於對推文（社交媒體平臺 Twitter 上的短文字訊息）進行分詞。它的工作方式類似於 word_tokenize() 方法，但它考慮了推文的特定功能，例如標籤、提及和表情符號。

以下是如何使用 TweetTokenizer 的示例：

import nltk 

# Download the NLTK tokenizer 
nltk.download('punkt')

from nltk.tokenize import TweetTokenizer

# Define the text to be tokenized
tweet = "This is an example tweet with #hashtag and @mention. 😊"

# Create a TweetTokenizer object
tokenizer = TweetTokenizer()

# Tokenize the text
tokens = tokenizer.tokenize(tweet)
print(tokens)

輸出

它將產生以下輸出：

['This', 'is', 'an', 'example', 'tweet', 'with', '#hashtag', 'and', '@mention', '😊']

如您所見，TweetTokenizer 不僅將文字分詞成單個單詞，而且還將標籤和提及保留為單獨的標記。此外，它可以處理推文中常用的表情符號、表情和特殊字元。

如果您正在處理 Twitter 資料並希望分析推文的特定方面（例如標籤和提及），這將非常有用。

使用正則表示式

正則表示式是匹配和操作字串的強大工具，它們可用於執行各種分詞任務。

示例

讓我們看一個使用正則表示式在 Python 中執行分詞的示例：

import re

text = "This is a sample text"

# Split on one or more whitespace characters
pattern = r"\s+"
tokens = re.split(pattern, text)
print(tokens)

# Split on words (any sequence of characters that are not whitespace)
pattern = r"\S+"
tokens = re.split(pattern, text)
print(tokens)

# Split on numbers (any sequence of digits)
pattern = r"\d+"
tokens = re.split(pattern, text)
print(tokens)

在此程式碼中，我們有三個部分：

第一部分使用匹配一個或多個空格字元的正則表示式模式，生成的標記是字串中的單詞。
第二部分使用匹配任何非空格字元序列的正則表示式模式，生成單個字元列表。
第三部分使用匹配任何數字序列的正則表示式模式，生成的標記是字串中的單詞和標點符號。

輸出

當您執行此程式碼時，它將產生以下輸出：

['This', 'is', 'a', 'sample', 'text']
['', ' ', ' ', ' ', ' ', '']
['This is a sample text']

使用 string 模組

Python 中的 string 模組提供了一些字串處理函式，包括一個可用於分詞字串的 Template 類。

要使用 Template 類，您需要匯入 string 模組並定義一個模板字串，其中包含要提取的標記的佔位符。例如：

import string
text = "This is a $token text"
template = string.Template(text)

然後，您可以使用 substitute() 方法將佔位符替換為實際值，並在空格字元上分割生成的字串：

tokens = template.substitute({"token": "sample"}).split(" ")
print(tokens)

這將用單詞“sample”替換佔位符 $token，並在空格字元上分割生成的字串，生成標記 ['This', is', 'a', 'sample', 'text']。

Template 類對於分詞具有可變值的字串（例如模板電子郵件或訊息）很有用。

使用 shlex 模組

shlex 模組為 shell 風格的語法提供了一個詞法分析器。它可以像 shell 一樣將字串分割成標記。

要使用 shlex 模組，您需要先匯入它：

import shlex
text = "This is a sample text"
tokens = shlex.split(text)
print(tokens)

這將在空格字元上分割字串，就像 split() 方法和 nltk 庫一樣。shlex 模組對於分詞具有 shell 風格語法的字串（例如命令列引數）很有用。

輸出

當您執行此程式碼時，它將產生以下輸出：

['This', 'is', 'a', 'sample', 'text']

結論

分詞是將字串分割成更小片段或標記的過程。在自然語言處理的上下文中，標記通常是單詞、標點符號和數字。分詞是許多 NLP 任務的重要預處理步驟，因為它允許您處理單個單詞和符號，而不是原始文字。

在本教程中，我們介紹了在 Python 中執行分詞的五種方法：使用 split() 方法、nltk 庫、正則表示式、string 模組和 shlex 模組。每種方法都有其自身的優點和侷限性，因此選擇最適合您需求的方法非常重要。無論您是處理簡單的字串還是複雜的人類語言資料，Python 都提供了一系列工具和庫，您可以使用它們有效地對文字進行分詞。

Gaurav Leekha

更新於：2023年8月21日

4K+ 閱讀量

開啟您的職業生涯

透過完成課程獲得認證

開始學習