使用Python評估神經機器翻譯的BLEU分數

在NLP中使用NMT或神經機器翻譯，我們可以將文字從一種給定語言翻譯成目標語言。為了評估翻譯的質量，我們使用Python中的BLEU（雙語評估研究）分數。

BLEU分數透過比較機器翻譯句子和人工翻譯句子（都以n-gram形式）來工作。此外，隨著句子長度的增加，BLEU分數會下降。通常，BLEU分數的範圍是0到1，數值越高表示質量越好。但是，獲得完美分數非常罕見。請注意，評估是基於子串匹配進行的，它沒有考慮語言的其他方面，例如連貫性、時態和語法等。

公式

BLEU = BP * exp(1/n * sum_{i=1}^{n} log(p_i))

這裡，各個術語具有以下含義：

BP是簡潔懲罰。它根據兩個文字的長度調整BLEU分數。其公式如下：

BP = min(1, exp(1 - (r / c)))

n是n-gram匹配的最大階數
p_i是精確度分數

演算法

步驟1 - 匯入datasets庫。
步驟2 - 使用load_metric函式，引數為bleu。
步驟3 - 將翻譯字串的單詞構成一個列表。
步驟4 - 對目標輸出字串的單詞重複步驟3。
步驟5 - 使用bleu.compute計算bleu值。

示例1

在這個例子中，我們將使用Python的NLTK庫來計算將德語句子機器翻譯成英語的BLEU分數。

源文字（德語） - es regnet heute
機器翻譯文字 - it rain today
目標文字 - it is raining today, it was raining today

雖然我們可以看出翻譯並不正確，但透過查詢BLUE分數，我們可以更好地瞭解翻譯質量。

示例

#import the libraries
from datasets import load_metric
  
#use the load_metric function
bleu = load_metric("bleu")

#setup the predicted string
predictions = [["it", "rain", "today"]]

#setup the desired string
references = [
   [["it", "is", "raining", "today"], 
   ["it", "was", "raining", "today"]]
]

#print the values
print(bleu.compute(predictions=predictions, references=references))

輸出

{'bleu': 0.0, 'precisions': [0.6666666666666666, 0.0, 0.0, 0.0], 'brevity_penalty': 0.7165313105737893, 'length_ratio': 0.75, 'translation_length': 3, 'reference_length': 4}

您可以看到翻譯做得不好，因此bleu分數為0。

示例2

在這個例子中，我們將再次計算BLEU分數。但這次，我們將使用一個法語句子，將其機器翻譯成英語。

源文字（法語） - nous partons en voyage
機器翻譯文字 - we going on a trip
目標文字 - we are going on a trip, we were going on a trip

您可以看到，這次翻譯文字更接近目標文字。讓我們檢查一下它的BLEU分數。

示例

#import the libraries
from datasets import load_metric
  
#use the load_metric function
bleu = load_metric("bleu")

#steup the predicted string
predictions = [["we", "going", "on", "a", "trip"]]

#steup the desired string
references = [
   [["we", "are", "going", "on", "a", "trip"], 
   ["we", "were", "going", "on", "a", "trip"]]
]

#print the values
print(bleu.compute(predictions=predictions, references=references))

輸出

{'bleu': 0.5789300674674098, 'precisions': [1.0, 0.75, 0.6666666666666666, 0.5], 'brevity_penalty': 0.8187307530779819, 'length_ratio': 0.8333333333333334, 'translation_length': 5, 'reference_length': 6}

您可以看到，這次翻譯非常接近目標輸出，因此藍色分數也高於0.5。

結論

BLEU分數是一個很好的工具，可以用來檢查翻譯模型的效率，從而進一步改進它以產生更好的結果。雖然BLEU分數可以用來大致瞭解模型，但它僅限於特定的詞彙表，並且常常忽略語言的細微差別。這就是為什麼BLEU分數與人工判斷如此缺乏協調的原因。但是，還有一些替代方法，例如ROUGE分數、METEOR指標和CIDEr指標，您可以嘗試一下。

Jaisshree

更新於：2023年8月7日

471 次瀏覽

開啟您的職業生涯

完成課程獲得認證

開始學習