NLP中WordNet裡單詞的義原集(Synsets)
介紹
WordNet是一個大型詞彙資料庫,存在於NLTK庫中,許多自然語言相關的用例都包含多種語言的WordNet。NLTK庫有一個稱為Synset的介面,允許我們查詢WordNet中的單詞。動詞、名詞等被分組到義原集中。
WordNet和義原集
下圖顯示了WordNet的結構。

在WordNet中,單詞之間的關係得以保持。例如,“悲傷”之類的詞語含義相似,並且在相似的語境下使用。這些詞語在使用過程中可以互換。這類詞語被分組到義原集中。每個義原集都相互關聯,並具有其自身的含義。由於其概念上的關係,這些義原集相互關聯。
WordNet中可能存在的關係是上位詞和下位詞。
上位詞(Hypernym)− 上位詞是指在語義上更抽象的術語。例如,如果我們考察顏色及其型別(如藍色、綠色、黃色等)之間的關係,那麼“顏色”就是上位詞。
下位詞(Hyponym)− 在上述顏色的例子中,諸如黃色、綠色等單個顏色被稱為下位詞,它們更具體。

程式碼實現
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet
synset = wordnet.synsets('book')[0]
print ("Name of the synset", synset.name())
print ("Meaning of the synset : ", synset.definition())
print ("Example of the synset : ", synset.examples())
print ("Abstract terminology ", synset.hypernyms())
print ("Specific terminology : ",synset.hypernyms()[0].hyponyms())
print ("hypernerm ( ROOT) : ", synset.root_hypernyms())
輸出
Name of the synset book.n.02
Synset meaning : physical objects consisting of a number of pages bound together
Synset example : ['he used a large book as a doorstop']
Abstract terminology [Synset('publication.n.01')]
Specific terminology : [Synset('book.n.01'), Synset('collection.n.02'), Synset('impression.n.06'), Synset('magazine.n.01'), Synset('new_edition.n.01'), Synset('periodical.n.01'), Synset('read.n.01'), Synset('reference.n.08'), Synset('reissue.n.01'), Synset('republication.n.01'), Synset('tip_sheet.n.01'), Synset('volume.n.04')]
hypernerm ( ROOT) : [Synset('entity.n.01')]
使用Pattern庫
!pip install pattern
from pattern.en import parse,singularize,pluralize
from pattern.en import pprint
pprint(parse("Jack and Jill went up the hill to fetch a bucket of water", relations=True, lemmata=True))
print("Plural of cat :", pluralize('cat'))
print("Singular of leaves :",singularize('leaves'))
輸出
WORD TAG CHUNK ROLE ID PNP LEMMA
Jack NNP NP SBJ 1 - jack
and CC NP ^ SBJ 1 - and
Jill NNP NP ^ SBJ 1 - jill
went VBD VP - 1 - go
up IN PP - - PNP up
the DT NP SBJ 2 PNP the
hill NN NP ^ SBJ 2 PNP hill
to TO VP - 2 - to
fetch VB VP ^ - 2 - fetch
a DT NP OBJ 2 - a
bucket NN NP ^ OBJ 2 - bucket
of IN PP - - PNP of
water NN NP - - PNP water
Plural of cat : cats
Singular of leaves : leaf
在spaCy中使用WordNet介面
!pip install spacy-wordnet
import spacy
import nltk
nltk.download('wordnet')
from spacy_wordnet.wordnet_annotator import WordnetAnnotator
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("spacy_wordnet", after='tagger')
spacy_token = nlp('leaves')[0]
print("Synsets : ",spacy_token._.wordnet.synsets())
print("Lemmas : ",spacy_token._.wordnet.lemmas())
print("Wordnet domains:",spacy_token._.wordnet.wordnet_domains())
輸出
Synsets : [Synset('leave.v.01'), Synset('leave.v.02'), Synset('leave.v.03'), Synset('leave.v.04'), Synset('exit.v.01'), Synset('leave.v.06'), Synset('leave.v.07'), Synset('leave.v.08'), Synset('entrust.v.02'), Synset('bequeath.v.01'), Synset('leave.v.11'), Synset('leave.v.12'), Synset('impart.v.01'), Synset('forget.v.04')]
Lemmas : [Lemma('leaf.n.01.leaf'), Lemma('leaf.n.01.leafage'), Lemma('leaf.n.01.foliage'), Lemma('leaf.n.02.leaf'), Lemma('leaf.n.02.folio'), Lemma('leaf.n.03.leaf'), Lemma('leave.n.01.leave'), Lemma('leave.n.01.leave_of_absence'), Lemma('leave.n.02.leave'), Lemma('farewell.n.02.farewell'), Lemma('farewell.n.02.leave'), Lemma('farewell.n.02.leave-taking'), Lemma('farewell.n.02.parting'), Lemma('leave.v.01.leave'), Lemma('leave.v.01.go_forth'), Lemma('leave.v.01.go_away'), Lemma('leave.v.02.leave'), Lemma('leave.v.03.leave'), Lemma('leave.v.04.leave'), Lemma('leave.v.04.leave_alone'), Lemma('leave.v.04.leave_behind'),
Wordnet domains: ['diplomacy', 'book_keeping', 'administration', 'factotum', 'agriculture', 'electrotechnology', 'person', 'telephony', 'mechanics']
NLTK Wordnet詞形還原器
from nltk.stem import WordNetLemmatizer
nltk_lammetizer = WordNetLemmatizer()
print("books :", nltk_lammetizer.lemmatize("books"))
print("formulae :", nltk_lammetizer.lemmatize("formulae"))
print("worse :", nltk_lammetizer.lemmatize("worse", pos ="a"))
輸出
books : book formulae : formula worse : bad
結論
義原集是查詢WordNet中單詞的介面。它們提供了一種非常有用的方法來查詢新詞和關係,因為相似的詞語與WordNet相互關聯,並形成一個緊密的網路。
廣告
資料結構
網路
關係型資料庫管理系統 (RDBMS)
作業系統
Java
iOS
HTML
CSS
Android
Python
C語言程式設計
C++
C#
MongoDB
MySQL
Javascript
PHP