如何使用 Python 對 Pandas 資料幀列進行模糊匹配?


我們將在第一個資料幀中匹配的行與第二個資料幀中的單詞進行匹配。對於最為接近的匹配,我們將使用閾值。我們將閾值設為 70,即,當字串之間的相似度達到 70% 以上時,才發生匹配。

讓我們首先建立字典並轉換為 panda 資料幀 −

# dictionaries
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}

d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}

# convert dictionaries to pandas dataframes
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

現在,將資料幀列轉換為元素列表以便進行模糊匹配 −

myList1 = df1['Car'].tolist()
myList2 = df2['Car'].tolist()

示例

以下為完整程式碼 −

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# dictionaries
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}

d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}

# convert dictionaries to pandas dataframes
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

# printing the pandas dataframes
print("Dataframe 1 = 
",df1) print("Dataframe 2 =
",df2) # empty lists for storing the matches later match1 = [] match2 = [] k = [] # converting dataframe column to list of elements for fuzzy matching myList1 = df1['Car'].tolist() myList2 = df2['Car'].tolist() threshold = 70 # iterating myList1 to extract closest match from myList2 for i in myList1:    match1.append(process.extractOne(i, myList2, scorer=fuzz.ratio)) df1['matches'] = match1 for j in df1['matches']:    if j[1] >= threshold:       k.append(j[0])    match2.append(",".join(k))    k = [] # saving matches to df1 df1['matches'] = match2 print("
Matches...") print(df1)

輸出

將產生以下輸出 −

Dataframe 1 =
       Car
0      BMW
1     Audi
2    Lexus
3 Mercedes
4    Rolls
Dataframe 2 =
           Car
0          BM
1        Audi
2          Le
3    Mercedes
4 Rolls Royce

Matches...
        Car matches
0       BM       BM
1     Audi     Audi
2    Lexus
3 Mercedes MERCEDES
4    Rolls

更新時間: 09-Sep-2021

2K+ 瀏覽量

開啟您的 職業 生涯

完成課程並獲得認證

開始學習
廣告
© . All rights reserved.