OpenNLP - 句子檢測

在處理自然語言時，確定句子的開始和結束是需要解決的問題之一。這個過程被稱為句界分歧（SBD）或簡單地稱為句子分割。

我們用來檢測給定文字中句子的技術，取決於文字的語言。

使用 Java 進行句子檢測

我們可以使用正則表示式和一組簡單的規則來檢測 Java 中給定文字中的句子。

例如，假設句號、問號或感嘆號在給定文字中表示句子的結束，那麼我們可以使用String類的split()方法分割句子。這裡，我們必須以字串格式傳遞正則表示式。

以下是使用 Java 正則表示式（split 方法）確定給定文字中句子的程式。將此程式儲存在名為SentenceDetection_RE.java的檔案中。

public class SentenceDetection_RE {  
   public static void main(String args[]){ 
     
      String sentence = " Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
     
      String simple = "[.?!]";      
      String[] splitString = (sentence.split(simple));     
      for (String string : splitString)   
         System.out.println(string);      
   } 
}

使用以下命令從命令提示符編譯並執行儲存的 Java 檔案。

javac SentenceDetection_RE.java 
java SentenceDetection_RE

執行後，上述程式將建立一個 PDF 文件，顯示以下訊息。

Hi 
How are you 
Welcome to Tutorialspoint 
We provide free tutorials on various technologies

使用 OpenNLP 進行句子檢測

為了檢測句子，OpenNLP 使用一個預定義的模型，一個名為en-sent.bin的檔案。此預定義模型經過訓練，可以檢測給定原始文字中的句子。

opennlp.tools.sentdetect包包含用於執行句子檢測任務的類和介面。

要使用 OpenNLP 庫檢測句子，您需要：

使用SentenceModel類載入en-sent.bin模型
例項化SentenceDetectorME類。
使用此類的sentDetect()方法檢測句子。

以下是編寫一個程式的步驟，該程式從給定的原始文字中檢測句子。

步驟 1：載入模型

句子檢測模型由名為SentenceModel的類表示，該類屬於opennlp.tools.sentdetect包。

要載入句子檢測模型：

建立模型的InputStream物件（例項化 FileInputStream 並將其建構函式中的模型路徑以字串格式傳遞給它）。
例項化SentenceModel類並將模型的InputStream（物件）作為引數傳遞給其建構函式，如下面的程式碼塊所示：

//Loading sentence detector model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/ensent.bin"); 
SentenceModel model = new SentenceModel(inputStream);

步驟 2：例項化 SentenceDetectorME 類

opennlp.tools.sentdetect包的SentenceDetectorME類包含用於將原始文字拆分為句子方法。此類使用最大熵模型來評估字串中的句子結束字元，以確定它們是否表示句子的結束。

例項化此類並將上一步建立的模型物件傳遞給它，如下所示。

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model);

步驟 3：檢測句子

SentenceDetectorME類的sentDetect()方法用於檢測傳遞給它的原始文字中的句子。此方法接受一個 String 變數作為引數。

透過將句子的字串格式傳遞給此方法來呼叫此方法。

//Detecting the sentence 
String sentences[] = detector.sentDetect(sentence);

示例

以下是檢測給定原始文字中句子的程式。將此程式儲存在名為SentenceDetectionME.java的檔案中。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel;  

public class SentenceDetectionME { 
  
   public static void main(String args[]) throws Exception { 
   
      String sentence = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
    
      //Detecting the sentence
      String sentences[] = detector.sentDetect(sentence); 
    
      //Printing the sentences 
      for(String sent : sentences)        
         System.out.println(sent);  
   } 
}

使用以下命令從命令提示符編譯並執行儲存的 Java 檔案：

javac SentenceDetectorME.java 
java SentenceDetectorME

執行後，上述程式讀取給定的字串並檢測其中的句子並顯示以下輸出。

Hi. How are you? 
Welcome to Tutorialspoint. 
We provide free tutorials on various technologies

檢測句子的位置

我們還可以使用SentenceDetectorME 類的 sentPosDetect() 方法檢測句子的位置。

以下是編寫一個程式的步驟，該程式從給定的原始文字中檢測句子的位置。

步驟 1：載入模型

句子檢測模型由名為SentenceModel的類表示，該類屬於opennlp.tools.sentdetect包。

要載入句子檢測模型：

建立模型的InputStream物件（例項化 FileInputStream 並將其建構函式中的模型路徑以字串格式傳遞給它）。
例項化SentenceModel類並將模型的InputStream（物件）作為引數傳遞給其建構函式，如下面的程式碼塊所示。

//Loading sentence detector model 
InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
SentenceModel model = new SentenceModel(inputStream);

步驟 2：例項化 SentenceDetectorME 類

例項化此類並將上一步建立的模型物件傳遞給它。

//Instantiating the SentenceDetectorME class 
SentenceDetectorME detector = new SentenceDetectorME(model);

步驟 3：檢測句子的位置

SentenceDetectorME類的sentPosDetect()方法用於檢測傳遞給它的原始文字中句子的位置。此方法接受一個 String 變數作為引數。

透過將句子的字串格式作為引數傳遞給此方法來呼叫此方法。

//Detecting the position of the sentences in the paragraph  
Span[] spans = detector.sentPosDetect(sentence);

步驟 4：列印句子的跨度

SentenceDetectorME類的sentPosDetect()方法返回一個型別為Span的物件陣列。opennlp.tools.util包中名為 Span 的類用於儲存集合的開始和結束整數。

您可以將sentPosDetect()方法返回的跨度儲存在 Span 陣列中並列印它們，如下面的程式碼塊所示。

//Printing the sentences and their spans of a sentence 
for (Span span : spans)         
System.out.println(paragraph.substring(span);

示例

以下是檢測給定原始文字中句子的程式。將此程式儲存在名為SentenceDetectionME.java的檔案中。

import java.io.FileInputStream; 
import java.io.InputStream; 
  
import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.util.Span;

public class SentencePosDetection { 
  
   public static void main(String args[]) throws Exception { 
   
      String paragraph = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
       
      //Detecting the position of the sentences in the raw text 
      Span spans[] = detector.sentPosDetect(paragraph); 
       
      //Printing the spans of the sentences in the paragraph 
      for (Span span : spans)         
         System.out.println(span);  
   } 
}

使用以下命令從命令提示符編譯並執行儲存的 Java 檔案：

javac SentencePosDetection.java 
java SentencePosDetection

執行後，上述程式讀取給定的字串並檢測其中的句子並顯示以下輸出。

[0..16) 
[17..43) 
[44..93)

句子及其位置

String 類的substring()方法接受開始和結束偏移量並返回相應的字串。我們可以使用此方法一起列印句子及其跨度（位置），如下面的程式碼塊所示。

for (Span span : spans)         
   System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);

以下是檢測給定原始文字中的句子並將其與位置一起顯示的程式。將此程式儲存在名為SentencesAndPosDetection.java的檔案中。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel; 
import opennlp.tools.util.Span; 
   
public class SentencesAndPosDetection { 
  
   public static void main(String args[]) throws Exception { 
     
      String sen = "Hi. How are you? Welcome to Tutorialspoint." 
         + " We provide free tutorials on various technologies"; 
      //Loading a sentence model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin"); 
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class 
      SentenceDetectorME detector = new SentenceDetectorME(model);  
       
      //Detecting the position of the sentences in the paragraph  
      Span[] spans = detector.sentPosDetect(sen);  
      
      //Printing the sentences and their spans of a paragraph 
      for (Span span : spans)         
         System.out.println(sen.substring(span.getStart(), span.getEnd())+" "+ span);  
   } 
}

使用以下命令從命令提示符編譯並執行儲存的 Java 檔案：

javac SentencesAndPosDetection.java 
java SentencesAndPosDetection

執行後，上述程式讀取給定的字串並檢測句子及其位置並顯示以下輸出。

Hi. How are you? [0..16) 
Welcome to Tutorialspoint. [17..43)  
We provide free tutorials on various technologies [44..93)

句子機率檢測

SentenceDetectorME類的getSentenceProbabilities()方法返回與最近對 sentDetect() 方法的呼叫關聯的機率。

//Getting the probabilities of the last decoded sequence       
double[] probs = detector.getSentenceProbabilities();

以下是列印與對 sentDetect() 方法的呼叫關聯的機率的程式。將此程式儲存在名為SentenceDetectionMEProbs.java的檔案中。

import java.io.FileInputStream; 
import java.io.InputStream;  

import opennlp.tools.sentdetect.SentenceDetectorME; 
import opennlp.tools.sentdetect.SentenceModel;  

public class SentenceDetectionMEProbs { 
  
   public static void main(String args[]) throws Exception { 
   
      String sentence = "Hi. How are you? Welcome to Tutorialspoint. " 
         + "We provide free tutorials on various technologies"; 
       
      //Loading sentence detector model 
      InputStream inputStream = new FileInputStream("C:/OpenNLP_models/en-sent.bin");
      SentenceModel model = new SentenceModel(inputStream); 
       
      //Instantiating the SentenceDetectorME class
      SentenceDetectorME detector = new SentenceDetectorME(model);  
      
      //Detecting the sentence 
      String sentences[] = detector.sentDetect(sentence); 
    
      //Printing the sentences 
      for(String sent : sentences)        
         System.out.println(sent);   
         
      //Getting the probabilities of the last decoded sequence       
      double[] probs = detector.getSentenceProbabilities(); 
       
      System.out.println("  "); 
       
      for(int i = 0; i<probs.length; i++) 
         System.out.println(probs[i]); 
   } 
}

使用以下命令從命令提示符編譯並執行儲存的 Java 檔案：

javac SentenceDetectionMEProbs.java 
java SentenceDetectionMEProbs

執行後，上述程式讀取給定的字串並檢測句子並列印它們。此外，它還返回與最近對 sentDetect() 方法的呼叫關聯的機率，如下所示。

Hi. How are you? 
Welcome to Tutorialspoint. 
We provide free tutorials on various technologies 
   
0.9240246995179983 
0.9957680129995953 
1.0

列印頁面