- Python - 文字處理
- Python - 文字處理簡介
- Python - 文字處理環境
- Python - 不可變字串
- Python - 排序行
- Python - 重新格式化段落
- Python - 統計段落中的標記
- Python - 二進位制 ASCII 轉換
- Python - 字串作為檔案
- Python - 反向檔案讀取
- Python - 過濾重複的單詞
- Python - 從文字中提取電子郵件
- Python - 從文字中提取 URL
- Python - 漂亮列印
- Python - 文字處理狀態機
- Python - 大寫字母和小寫字母轉換
- Python - 標記化
- Python - 刪除停用詞
- Python - 同義詞和反義詞
- Python - 文字翻譯
- Python - 單詞替換
- Python - 拼寫檢查
- Python - WordNet 介面
- Python - 語料庫訪問
- Python - 單詞標註
- Python - 塊和切塊
- Python - 塊分類
- Python - 文字分類
- Python - 雙語詞
- Python - 處理 PDF
- Python - 處理 Word 文件
- Python - 讀取 RSS 源
- Python - 情感分析
- Python - 搜尋和匹配
- Python - 文字混編
- Python - 文字換行
- Python - 頻率分佈
- Python - 文字摘要
- Python - 詞幹演算法
- Python - 受約束的搜尋
Python - 單詞標註
標記是文字處理的一項基本特性,其中我們將單詞標記為語法類別。我們藉助標記化和 pos_tag 函式為每個單詞建立標記。
import nltk
text = nltk.word_tokenize("A Python is a serpent which eats eggs from the nest")
tagged_text=nltk.pos_tag(text)
print(tagged_text)
當我們執行上述程式時,會獲得以下輸出 −
[('A', 'DT'), ('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('serpent', 'NN'),
('which', 'WDT'), ('eats', 'VBZ'), ('eggs', 'NNS'), ('from', 'IN'),
('the', 'DT'), ('nest', 'JJS')]
標記描述
我們可以使用以下程式來描述每個標記的含義,該程式顯示內建值。
import nltk
nltk.help.upenn_tagset('NN')
nltk.help.upenn_tagset('IN')
nltk.help.upenn_tagset('DT')
當我們執行上述程式時,會獲得以下輸出 −
NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist ...
IN: preposition or conjunction, subordinating
astride among uppon whether out inside pro despite on by throughout
below within for towards near behind atop around if like until below
next into if beside ...
DT: determiner
all an another any both del each either every half la many much nary
neither no some such that the them these this those
標記語料庫
我們還可以標記語料庫資料,並檢視該語料庫中每個單詞的標記結果。
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg
sample = gutenberg.raw("blake-poems.txt")
tokenized = sent_tokenize(sample)
for i in tokenized[:2]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
當我們執行上述程式時,會獲得以下輸出 −
[([', 'JJ'), (Poems', 'NNP'), (by', 'IN'), (William', 'NNP'), (Blake', 'NNP'), (1789', 'CD'), (]', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (AND', 'NNP'), (OF', 'NNP'), (EXPERIENCE', 'NNP'), (and', 'CC'), (THE', 'NNP'), (BOOK', 'NNP'), (of', 'IN'), (THEL', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (INTRODUCTION', 'NNP'), (Piping', 'VBG'), (down', 'RP'), (the', 'DT'), (valleys', 'NN'), (wild', 'JJ'), (,', ','), (Piping', 'NNP'), (songs', 'NNS'), (of', 'IN'), (pleasant', 'JJ'), (glee', 'NN'), (,', ','), (On', 'IN'), (a', 'DT'), (cloud', 'NN'), (I', 'PRP'), (saw', 'VBD'), (a', 'DT'), (child', 'NN'), (,', ','), (And', 'CC'), (he', 'PRP'), (laughing', 'VBG'), (said', 'VBD'), (to', 'TO'), (me', 'PRP'), (:', ':'), (``', '``'), (Pipe', 'VB'), (a', 'DT'), (song', 'NN'), (about', 'IN'), (a', 'DT'), (Lamb', 'NN'), (!', '.'), (u"''", "''")]
廣告