使用Python spacy進行句子分割

執行句子分割是自然語言處理 (NLP) 中一項至關重要的任務。在本文中，我們將探討如何利用spacy（一個有效的Python NLP庫）實現句子劃分。句子分割包括將文字檔案分割成獨立的句子，為各種NLP應用奠定基礎。我們將介紹三種方法：基於規則的分割使用spacy預訓練模型，基於機器學習的分割使用自定義訓練，以及使用spacy Matcher類建立自定義分割器。這些方法提供了靈活性和效率，允許開發人員有效地分割他們在基於Python的NLP專案中的句子。

使用Python spacy進行句子分割

簡單的整合 − spacy以其速度和效率而聞名。它在設計時就考慮到了效能，並使用了最佳化的演算法，使其非常適合高效地處理大量文字。
高效快捷 − spacy為多種語言（包括英語）提供預訓練模型，其中包含開箱即用的句子分割功能。這些模型是在大型語料庫上訓練的，並且不斷更新和改進，從而節省了您從頭開始訓練自己模型的精力。
預訓練模型 − spacy的預訓練模型和語言規則可以精確地識別基於標點符號、大寫和其他語言特定訊號的句子邊界。這確保了可靠的句子分割結果，即使在句子邊界並非始終由句號表示的情況下也是如此。
可定製性 − spacy允許您根據您的特定需求自定義和微調句子分割過程。您可以使用標記資料訓練您自己的機器學習模型，或者使用Matcher類建立自定義規則來處理特定情況或特定領域的特殊需求。
自定義 − 句子分割對於許多NLP任務至關重要，例如詞性標註、命名實體識別和情感分析。

方法一：基於規則的句子分割

演算法

我們將探討的第一種方法是使用spacy進行基於規則的句子分割。
spacy提供了一個名為“en_core_web_sm”的預訓練英語庫，其中包含預設的句子分割器。
這演示了使用一組規則來根據標點符號和其他語言特定的提示來確定句子邊界。

示例

#pip install spacy
#python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")

text = "This is the first sentence. This is the second sentence. And this is the third sentence."

doc = nlp(text)

sentences = [sent.text for sent in doc.sents]

for sentence in sentences:
    print(sentence)

輸出

This is the first sentence.
This is the second sentence.
And this is the third sentence.

方法二：基於機器學習的句子分割

演算法

我們將探討的第二種方法是使用spacy進行基於機器學習的句子分割。
spacy允許您使用標記資料訓練您自己的自定義句子分割器。
要訓練基於機器學習的句子分割器，我們需要一個語料庫，其中包含已手動標註句子邊界的文字。
語料庫中的每個句子都應該用起始和結束偏移量標記。

示例

import spacy
from spacy.gold import GoldParse
from spacy.language import EntityRecognizer

nlp = spacy.load("en_core_web_sm")

text = "This is the first sentence. This is the second sentence. And this is the third sentence."

sentences = ["This is the first sentence.", "This is the second sentence.", "And this is the third sentence."]

annotations = [{"entities": [(0, 25)]}, {"entities": [(0, 26)]}, {"entities": [(0, 25)]}]

train_data = list(zip(sentences, annotations))

nlp.entity.add_label("SENTENCE")

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()

    for i in range(10):
        losses = {}

        for text, annotations in train_data:
            doc = nlp.make_doc(text)
            gold = GoldParse(doc, entities=annotations)
            nlp.update([gold], sgd=optimizer, losses=losses)

        print(losses)

doc = nlp(text)

sentences = [sent.text for sent in doc.sents]

for sentence in sentences:
    print(sentence)

輸出

This is the first sentence.
This is the second sentence.
And this is the third sentence.

結論

在本文中，我們探討了使用Python中的spacy執行句子分割的兩種不同方法。我們首先介紹了Spacy內建的基於規則的句子分割器，它提供了一種便捷的方式來根據標點符號和特定語言的規則來分割句子。然後，我們探討了一種基於機器學習的方法，其中我們使用標記資料訓練了一個自定義句子分割器。每種方法都有其自身的優勢，可以根據您的NLP專案的需要來應用。無論您需要簡單的基於規則的分割器還是更高階的基於機器學習的解決方案，spacy都提供了靈活性和控制能力來有效地處理句子分割。

Pranavnath

更新於：2023年9月1日

602 次瀏覽

啟動您的職業生涯

透過完成課程獲得認證

開始