Python程式查詢文字檔案中的唯一單詞數量

本文的給定任務是查詢文字檔案中的唯一單詞數量。在這篇Python文章中，使用兩個不同的示例給出了查詢文字檔案中的唯一單詞及其數量的方法。在第一個示例中，從文字檔案獲取給定的單詞，然後在計數這些唯一單詞之前建立它們的唯一集合。在示例2中，首先建立單詞列表，然後對其進行排序。在此之後，從此排序列表中刪除重複項，最後計算檔案中剩餘的唯一單詞以給出最終結果。

預處理演算法

步驟1 - 使用Google帳戶登入。轉到Google Colab。開啟一個新的Colab筆記本並在其中編寫Python程式碼。

步驟2 - 首先將txt檔案“file1.txt”上傳到Google Colab。

步驟3 - 開啟txt檔案以進行讀取。

步驟4 - 將文字檔案轉換為小寫。

步驟5 - 要分隔txt檔案中給定的單詞，請使用split函式。

步驟6 - 列印名為“words_in_file”的列表，其中包含來自文字檔案的單詞。

用於這些示例的文字檔案

file1.txt中的內容如下所示…

This is a new file.
This is made for testing purposes only.
There are four lines in this file.
There are four lines in this file.
There are four lines in this file.
There are four lines in this file.
Oh! No.. there are seven lines now.

將file1.txt上傳到colab

圖：在Google Colab中上傳file1.txt

方法1：- 使用Python集合查詢文字檔案中的唯一單詞數量

在預處理步驟之後，以下步驟用於方法1

步驟1 - 從預處理步驟後的列表“words_in_file”開始。

步驟2 - 將此列表轉換為集合。在這裡，集合將僅包含唯一單詞。

步驟3 - 使用print語句顯示包含所有唯一單詞的集合。

步驟4 - 查詢集合長度。

步驟5 - 列印集合長度。

步驟6 - 這將給出給定字串中唯一單詞的數量。

示例

# Use open method to open the respective text file
file = open("file1.txt", 'r')

#Conversion of its content to lowercase
thegiventxtfile = file.read().lower()

#ALter the sentences to the list of words
words_in_file = thegiventxtfile.split()

print("The given txt file content is :\n")
print(thegiventxtfile)
print("\nThe words given in the txt file are :\n")
print(words_in_file)
print("\nThe unique words given in this txt file are :\n")

#Convert to the python set
uniqueWords=set(words_in_file)

print(uniqueWords) 

#Find the number of words left in this list
numberofuniquewords=len(uniqueWords)

print("\nThe number of unique words given in this txt file are :\n")
print(numberofuniquewords)

輸出

The given txt file content is :

this is a new file.
this is made for testing purposes only.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
oh! no.. there are seven lines now.


The words given in the txt file are :

['this', 'is', 'a', 'new', 'file.', 'this', 'is', 'made', 'for', 'testing', 'purposes', 'only.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'oh!', 'no..', 'there', 'are', 'seven', 'lines', 'now.']

The unique words given in this txt file are :

{'there', 'only.', 'testing', 'new', 'is', 'for', 'oh!', 'this', 'a', 'made', 'seven', 'are', 'purposes', 'in', 'file.', 'four', 'now.', 'no..', 'lines'}

The number of unique words given in this txt file are :

19

方法2：- 使用Python字典查詢文字檔案中的唯一單詞數量

步驟1 - 開啟所需的檔案。

步驟2 - 對此列表進行排序並列印此列表。按字母順序排序的列表也將顯示重複的單詞。

步驟3 - 現在，為了去除重複的單詞並僅保留唯一的單詞，請使用dict.fromkeys(words_in_file)

步驟4 - 現在必須將其轉換回列表。

步驟5 - 最後列印包含唯一單詞的列表。

步驟6 - 計算最終列表的長度並顯示其值。這將給出給定字串中唯一單詞的數量。

示例

#Open the text file in read mode
file = open("file1.txt", 'r')

#Convert its content to lowercase
thegiventxtfile = file.read().lower()

#Change the sentences to the list of words
words_in_file = thegiventxtfile.split()

print("The given txt file content is :\n")
print(thegiventxtfile)
print("\nThe words given in the txt file are :\n")
print(words_in_file)
print("\nThe sorted words list from this txt file is :\n")

#Sort this words file now
words_in_file.sort()
  
print(words_in_file)
print("\nThe sorted words list after removing duplicates from this txt file is :\n")

#Get rid of the duplicate words
myuniquewordlist = list(dict.fromkeys(words_in_file))

#Count the number of words left
numberofuniquewords=len(uniqueWords)

print(myuniquewordlist) 
print("\nThe number of unique words given in this txt file are :\n")

輸出

The given txt file content is :

this is a new file.
this is made for testing purposes only.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
oh! no.. there are seven lines now.


The words given in the txt file are :

['this', 'is', 'a', 'new', 'file.', 'this', 'is', 'made', 'for', 'testing', 'purposes', 'only.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'oh!', 'no..', 'there', 'are', 'seven', 'lines', 'now.']

The sorted words list from this txt file is :

['a', 'are', 'are', 'are', 'are', 'are', 'file.', 'file.', 'file.', 'file.', 'file.', 'for', 'four', 'four', 'four', 'four', 'in', 'in', 'in', 'in', 'is', 'is', 'lines', 'lines', 'lines', 'lines', 'lines', 'made', 'new', 'no..', 'now.', 'oh!', 'only.', 'purposes', 'seven', 'testing', 'there', 'there', 'there', 'there', 'there', 'this', 'this', 'this', 'this', 'this', 'this']

The sorted words list after removing duplicates from this txt file is :

['a', 'are', 'file.', 'for', 'four', 'in', 'is', 'lines', 'made', 'new', 'no..', 'now.', 'oh!', 'only.', 'purposes', 'seven', 'testing', 'there', 'this']

The number of unique words given in this txt file are :

19

結論

兩種不同的方法來展示如何查詢給定txt檔案中的唯一單詞。首先，txt檔案上傳到colab筆記本中。然後開啟此檔案以進行讀取。然後拆分此檔案並分隔單詞並將其儲存為列表。在這篇Python文章中，此單詞列表在兩個示例中都使用。

在示例1中，使用了Python集合的概念。列表可能包含重複的單詞。當此列表轉換為集合時，只會保留唯一單詞。要計算唯一單詞的數量，使用len()函式。在示例2中，從txt檔案獲得的單詞列表首先被排序以檢視重複單詞的數量，這些單詞在排序後被放在一起。現在，此排序列表與dict.fromkeys(words_in_file)一起使用以刪除重複的單詞。稍後它用於查詢重複單詞的數量。

Saba Hilal

更新於：2023年7月10日

3K+ 次瀏覽

開啟你的職業生涯

透過完成課程獲得認證

開始