如何使用 Python 對 Pandas 資料幀列進行模糊匹配?
我們將在第一個資料幀中匹配的行與第二個資料幀中的單詞進行匹配。對於最為接近的匹配,我們將使用閾值。我們將閾值設為 70,即,當字串之間的相似度達到 70% 以上時,才發生匹配。
讓我們首先建立字典並轉換為 panda 資料幀 −
# dictionaries
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}
d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}
# convert dictionaries to pandas dataframes
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)現在,將資料幀列轉換為元素列表以便進行模糊匹配 −
myList1 = df1['Car'].tolist() myList2 = df2['Car'].tolist()
示例
以下為完整程式碼 −
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# dictionaries
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}
d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}
# convert dictionaries to pandas dataframes
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
# printing the pandas dataframes
print("Dataframe 1 =
",df1)
print("Dataframe 2 =
",df2)
# empty lists for storing the matches later
match1 = []
match2 = []
k = []
# converting dataframe column to list of elements for fuzzy matching
myList1 = df1['Car'].tolist()
myList2 = df2['Car'].tolist()
threshold = 70
# iterating myList1 to extract closest match from myList2
for i in myList1:
match1.append(process.extractOne(i, myList2, scorer=fuzz.ratio))
df1['matches'] = match1
for j in df1['matches']:
if j[1] >= threshold:
k.append(j[0])
match2.append(",".join(k))
k = []
# saving matches to df1
df1['matches'] = match2
print("
Matches...")
print(df1)輸出
將產生以下輸出 −
Dataframe 1 = Car 0 BMW 1 Audi 2 Lexus 3 Mercedes 4 Rolls Dataframe 2 = Car 0 BM 1 Audi 2 Le 3 Mercedes 4 Rolls Royce Matches... Car matches 0 BM BM 1 Audi Audi 2 Lexus 3 Mercedes MERCEDES 4 Rolls
廣告
資料結構
網路
RDBMS
作業系統
Java
iOS
HTML
CSS
Android
Python
C 程式設計
C++
C#
MongoDB
MySQL
Javascript
PHP