自然語言工具包 - 詞語替換

詞幹提取和詞形還原可以被認為是一種語言壓縮。同樣，詞語替換可以被認為是文字規範化或錯誤糾正。

但是為什麼我們需要詞語替換呢？假設我們談論標記化，那麼它在處理縮寫（如can’t、won’t等）時存在問題。因此，為了處理此類問題，我們需要詞語替換。例如，我們可以用它們的擴充套件形式替換縮寫。

使用正則表示式的詞語替換

首先，我們將替換與正則表示式匹配的單詞。為此，我們必須對正則表示式以及python re模組有一個基本的瞭解。在下面的示例中，我們將使用正則表示式將縮寫替換為它們的擴充套件形式（例如，“can’t”將替換為“cannot”）。

示例

首先，匯入必要的包re來使用正則表示式。

import re
from nltk.corpus import wordnet

接下來，定義您選擇的替換模式，如下所示：

R_patterns = [
   (r'won\'t', 'will not'),
   (r'can\'t', 'cannot'),
   (r'i\'m', 'i am'),
   r'(\w+)\'ll', '\g<1> will'),
   (r'(\w+)n\'t', '\g<1> not'),
   (r'(\w+)\'ve', '\g<1> have'),
   (r'(\w+)\'s', '\g<1> is'),
   (r'(\w+)\'re', '\g<1> are'),
]

現在，建立一個可用於替換單詞的類：

class REReplacer(object):
   def __init__(self, pattern = R_patterns):
      self.pattern = [(re.compile(regex), repl) for (regex, repl) in patterns]
   def replace(self, text):
      s = text
      for (pattern, repl) in self.pattern:
         s = re.sub(pattern, repl, s)
      return s

儲存此python程式（例如repRE.py），並從python命令提示符執行它。執行後，當您想要替換單詞時，匯入REReplacer類。讓我們看看如何操作。

from repRE import REReplacer
rep_word = REReplacer()
rep_word.replace("I won't do it")
Output:
'I will not do it'
rep_word.replace("I can’t do it")
Output:
'I cannot do it'

完整的實現示例

import re
from nltk.corpus import wordnet
R_patterns = [
   (r'won\'t', 'will not'),
   (r'can\'t', 'cannot'),
   (r'i\'m', 'i am'),
   r'(\w+)\'ll', '\g<1> will'),
   (r'(\w+)n\'t', '\g<1> not'),
   (r'(\w+)\'ve', '\g<1> have'),
   (r'(\w+)\'s', '\g<1> is'),
   (r'(\w+)\'re', '\g<1> are'),
]
class REReplacer(object):
def __init__(self, patterns=R_patterns):
   self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
def replace(self, text):
   s = text
   for (pattern, repl) in self.patterns:
      s = re.sub(pattern, repl, s)
   return s

現在，儲存上述程式並執行後，您可以匯入該類並按如下方式使用它：

from replacerRE import REReplacer
rep_word = REReplacer()
rep_word.replace("I won't do it")

輸出

'I will not do it'

文字處理前的替換

在處理自然語言處理 (NLP) 時，一種常見的做法是在文字處理之前清理文字。在這方面，我們也可以使用我們在前面示例中建立的REReplacer類作為文字處理（即標記化）之前的初步步驟。

示例

from nltk.tokenize import word_tokenize
from replacerRE import REReplacer
rep_word = REReplacer()
word_tokenize("I won't be able to do this now")
Output:
['I', 'wo', "n't", 'be', 'able', 'to', 'do', 'this', 'now']
word_tokenize(rep_word.replace("I won't be able to do this now"))
Output:
['I', 'will', 'not', 'be', 'able', 'to', 'do', 'this', 'now']

在上面的Python示例中，我們可以很容易地理解使用和不使用正則表示式替換的詞語標記器的輸出之間的區別。

重複字元的移除

我們在日常語言中是否嚴格遵循語法？不，我們沒有。例如，有時我們會寫“Hiiiiiiiiiiii Mohan”來強調“Hi”這個詞。但是計算機系統不知道“Hiiiiiiiiiiii”是“Hi”這個詞的變體。在下面的示例中，我們將建立一個名為rep_word_removal的類，該類可用於移除重複的單詞。

示例

首先，匯入必要的包re來使用正則表示式

import re
from nltk.corpus import wordnet

現在，建立一個可用於移除重複單詞的類：

class Rep_word_removal(object):
   def __init__(self):
      self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
      self.repl = r'\1\2\3'
   def replace(self, word):
      if wordnet.synsets(word):
      return word
   repl_word = self.repeat_regexp.sub(self.repl, word)
   if repl_word != word:
      return self.replace(repl_word)
   else:
      return repl_word

儲存此python程式（例如removalrepeat.py），並從python命令提示符執行它。執行後，當您想要移除重複的單詞時，匯入Rep_word_removal類。讓我們看看如何操作？

from removalrepeat import Rep_word_removal
rep_word = Rep_word_removal()
rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii")
Output:
'Hi'
rep_word.replace("Hellooooooooooooooo")
Output:
'Hello'

完整的實現示例

import re
from nltk.corpus import wordnet
class Rep_word_removal(object):
   def __init__(self):
      self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
      self.repl = r'\1\2\3'
   def replace(self, word):
      if wordnet.synsets(word):
         return word
   replace_word = self.repeat_regexp.sub(self.repl, word)
   if replace_word != word:
      return self.replace(replace_word)
   else:
      return replace_word

現在，儲存上述程式並執行後，您可以匯入該類並按如下方式使用它：

from removalrepeat import Rep_word_removal
rep_word = Rep_word_removal()
rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii")

輸出

'Hi'

列印頁面