
- Python - 文字處理
- Python - 文字處理簡介
- Python - 文字處理環境
- Python - 字串不變性
- Python - 對行排序
- Python - 段落重新格式化
- Python - 段落中詞語計數
- Python - 二進位制ASCII轉換
- Python - 字串作為檔案
- Python - 反向檔案讀取
- Python - 過濾重複單詞
- Python - 從文字中提取電子郵件
- Python - 從文字中提取URL
- Python - 美化列印
- Python - 文字處理狀態機
- Python - 大寫和翻譯
- Python - 分詞
- Python - 刪除停用詞
- Python - 同義詞和反義詞
- Python - 文字翻譯
- Python - 替換單詞
- Python - 拼寫檢查
- Python - WordNet 介面
- Python - 語料庫訪問
- Python - 詞性標註
- Python - 組塊和組塊間隙
- Python - 組塊分類
- Python - 文字分類
- Python - 二元語法
- Python - 處理PDF
- Python - 處理Word文件
- Python - 讀取RSS Feed
- Python - 情感分析
- Python - 搜尋和匹配
- Python - 文字轉換
- Python - 文字換行
- Python - 頻率分佈
- Python - 文字摘要
- Python - 詞幹提取演算法
- Python - 受約束的搜尋
Python - 段落中詞語計數
在從源讀取文字時,有時我們也需要找出有關所用單詞型別的一些統計資訊。這使得有必要統計單詞的數量以及給定文字中具有特定型別單詞的行數。在下面的示例中,我們展示了使用兩種不同方法來統計段落中單詞數量的程式。為此,我們考慮了一個文字檔案,其中包含好萊塢電影的摘要。
讀取檔案
FileName = ("Path\GodFather.txt") with open(FileName, 'r') as file: lines_in_file = file.read() print lines_in_file
當我們執行上述程式時,我們得到以下輸出:
Vito Corleone is the aging don (head) of the Corleone Mafia Family. His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael's sister) to Carlo Rizzi. All of Michael's family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. He approaches Don Corleone about it, but, much against the advice of the Don's lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer. This does not please Sollozzo, who has the Don shot down by some of his hit men. The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart.
使用nltk計數單詞
接下來,我們使用nltk模組來統計文字中的單詞。請注意,單詞“(head)”被計為3個單詞,而不是一個。
import nltk FileName = ("Path\GodFather.txt") with open(FileName, 'r') as file: lines_in_file = file.read() nltk_tokens = nltk.word_tokenize(lines_in_file) print nltk_tokens print "\n" print "Number of Words: " , len(nltk_tokens)
當我們執行上述程式時,我們得到以下輸出:
['Vito', 'Corleone', 'is', 'the', 'aging', 'don', '(', 'head', ')', 'of', 'the', 'Corleone', 'Mafia', 'Family', '.', 'His', 'youngest', 'son', 'Michael', 'has', 'returned', 'from', 'WWII', 'just', 'in', 'time', 'to', 'see', 'the', 'wedding', 'of', 'Connie', 'Corleone', '(', 'Michael', "'s", 'sister', ')', 'to', 'Carlo', 'Rizzi', '.', 'All', 'of', 'Michael', "'s", 'family', 'is', 'involved', 'with', 'the', 'Mafia', ',', 'but', 'Michael', 'just', 'wants', 'to', 'live', 'a', 'normal', 'life', '.', 'Drug', 'dealer', 'Virgil', 'Sollozzo', 'is', 'looking', 'for', 'Mafia', 'families', 'to', 'offer', 'him', 'protection', 'in', 'exchange', 'for', 'a', 'profit', 'of', 'the', 'drug', 'money', '.', 'He', 'approaches', 'Don', 'Corleone', 'about', 'it', ',', 'but', ',', 'much', 'against', 'the', 'advice', 'of', 'the', 'Don', "'s", 'lawyer', 'Tom', 'Hagen', ',', 'the', 'Don', 'is', 'morally', 'against', 'the', 'use', 'of', 'drugs', ',', 'and', 'turns', 'down', 'the', 'offer', '.', 'This', 'does', 'not', 'please', 'Sollozzo', ',', 'who', 'has', 'the', 'Don', 'shot', 'down', 'by', 'some', 'of', 'his', 'hit', 'men', '.', 'The', 'Don', 'barely', 'survives', ',', 'which', 'leads', 'his', 'son', 'Michael', 'to', 'begin', 'a', 'violent', 'mob', 'war', 'against', 'Sollozzo', 'and', 'tears', 'the', 'Corleone', 'family', 'apart', '.'] Number of Words: 167
使用split計數單詞
接下來,我們使用split函式來統計單詞,在這裡,單詞“(head)”被計為一個單詞,而不是像使用nltk那樣計為3個單詞。
FileName = ("Path\GodFather.txt") with open(FileName, 'r') as file: lines_in_file = file.read() print lines_in_file.split() print "\n" print "Number of Words: ", len(lines_in_file.split())
當我們執行上述程式時,我們得到以下輸出:
['Vito', 'Corleone', 'is', 'the', 'aging', 'don', '(head)', 'of', 'the', 'Corleone', 'Mafia', 'Family.', 'His', 'youngest', 'son', 'Michael', 'has', 'returned', 'from', 'WWII', 'just', 'in', 'time', 'to', 'see', 'the', 'wedding', 'of', 'Connie', 'Corleone', "(Michael's", 'sister)', 'to', 'Carlo', 'Rizzi.', 'All', 'of', "Michael's", 'family', 'is', 'involved', 'with', 'the', 'Mafia,', 'but', 'Michael', 'just', 'wants', 'to', 'live', 'a', 'normal', 'life.', 'Drug', 'dealer', 'Virgil', 'Sollozzo', 'is', 'looking', 'for', 'Mafia', 'families', 'to', 'offer', 'him', 'protection', 'in', 'exchange', 'for', 'a', 'profit', 'of', 'the', 'drug', 'money.', 'He', 'approaches', 'Don', 'Corleone', 'about', 'it,', 'but,', 'much', 'against', 'the', 'advice', 'of', 'the', "Don's", 'lawyer', 'Tom', 'Hagen,', 'the', 'Don', 'is', 'morally', 'against', 'the', 'use', 'of', 'drugs,', 'and', 'turns', 'down', 'the', 'offer.', 'This', 'does', 'not', 'please', 'Sollozzo,', 'who', 'has', 'the', 'Don', 'shot', 'down', 'by', 'some', 'of', 'his', 'hit', 'men.', 'The', 'Don', 'barely', 'survives,', 'which', 'leads', 'his', 'son', 'Michael', 'to', 'begin', 'a', 'violent', 'mob', 'war', 'against', 'Sollozzo', 'and', 'tears', 'the', 'Corleone', 'family', 'apart.'] Number of Words: 146
廣告