使用isin排除法過濾PySpark資料框

Python 是一種面向物件、動態語義、高階、解釋型程式語言。快速應用程式開發，以及用作指令碼或粘合語言以將現有元件組合在一起，發現其高階內建資料結構，加上動態型別和動態繫結，使其特別有吸引力。

PySpark資料框

資料在 PySpark 資料框中組織成命名列，這些資料框是資料的分散式集合，可以在不同的計算機上執行。這些資料框可以來自現有的彈性分散式資料集 (RDD)、外部資料庫或結構化資料檔案。

語法 - Isin ()

isin(list_)

list_a 引數採用列的值作為值的列表。

這用於檢查或過濾資料框值是否出現在值的列表中，使用 PySpark 的 isin() 或 IN 運算子。

Column 類的 isin() 函式返回布林值。如果引數的計算值包含表示式的值，則為 True。

示例 1

# Import the necessary libraries
from pyspark.sql.functions import col
 
# Create a PySpark DataFrame
data = [("Karthik", 25), ("Vijay", 30), ("Kruthik", 35), ("Sricharan", 40), ("Aditi", 45)]
df = spark.createDataFrame(data, ["name", "age"])
 
# Define the list of values to exclude
exclusion_list = [25, 30]
 
# Filter out rows where age is in the exclusion list
filtered_df = df.filter(~col("age").isin(exclusion_list))
 
# Display the filtered DataFrame
filtered_df.show()

輸出

+------+---+
|  name|age|
+------+---+
|Kruthik| 35|
| Sricharan| 40|
|   Aditi| 45|
+------+---+

示例 2

# Import the necessary libraries
from pyspark.sql.functions import col 
# Create a PySpark DataFrame
data = [("Karthik", "New York"), ("Vijay", "Chicago"), ("Kruthik", "San Francisco"), ("Sricharan", "Los Angeles"), ("Aditi", "Miami")]
df = spark.createDataFrame(data, ["name", "city"])
# Define the list of values to exclude
exclusion_list = ["New York", "Chicago", "Miami"] 
# Filter out rows where city is in the exclusion list
filtered_df = df.filter(~col("city").isin(exclusion_list))
# Display the filtered DataFrame
filtered_df.show()

輸出

+------+--------------+
| name|   	city|
+------+--------------+
|Sricharan |   Los Angeles|
|Kruthik|San Francisco|
+------+--------------+

建立演示資料框

示例 1

# Create a PySpark DataFrame
data = [("Alice", "New York"), ("Bob", "Chicago"), ("Charlie", "San Francisco"), ("David", 
"Los Angeles"), ("Eva", "Miami")]
df = spark.createDataFrame(data, ["name", "city"])
# Define the list of values to exclude
exclusion_list = ["New York", "Chicago", "Miami"]
# Filter out rows where city is in the exclusion list and name is not "David"
filtered_df = df.filter(~(col("city").isin(exclusion_list) & (col("name") != "David")))
# Display the filtered DataFrame
filtered_df.show()

輸出

+------+--------------+
|  name|     city|
+------+--------------+
|David |   Los Angeles|
|Charlie|San Francisco|
+------+--------------+

示例 2

# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data with null values
# we can define null values with none
data = [[1, "Karthik", "Sharma"],
         	[2, "Kruthik", "Ballari"],
         	[3, "Vijay", "Kumar"],
         	[4, "Aditi", "Gupta"],
         	[5, "Sricharan", "Sharma"]]
# specify column names
columns = ['ID', 'NAME', 'Lastname']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()

輸出

+------+---------+----------------+
ID   	NAME  LASTNAME
+------+---------+----------------+
1   	|Karthik| Sharma |
2   	|Kruthik| Ballari |
3   	|Vijay| Kumar |
4   	|Aditi| Gupta |
5   	|Sricharan| Sharma |

示例 3

以下程式碼說明了從資料框列中獲取名稱並顯示它們。

filter()：此子句用於檢查條件並給出結果，兩者相似

語法

dataframe.filter(condition)

# Getting Kruthik's name
dataframe.filter((dataframe.NAME).isin(['Kruthik'])).show()

輸出

+------+---------+----------------+
ID   	NAME  LASTNAME
+------+---------+----------------+
1   	|Kruthik| Ballari |

示例 4

以下程式說明了從姓氏為 Sharma 的資料框中獲取資料並列印其全名。

where()：此子句用於檢查條件並給出結果

語法

dataframe.where(condition)

# Fetching names of people whose last name is Sharma
dataframe.where((dataframe.college).isin(['Sharma'])).show()

輸出

+------+---------+----------------+
ID   	NAME  LASTNAME
+------+---------+----------------+
1        |Karthik| Sharma |
2        |Sricharan| Sharma |

結論

排除法 isin 函式是一種非常有用的方法，可以過濾掉 PySpark 資料框中列值與預定義值列表不匹配的行。它有許多過濾選項，可以在不同的情況下應用。

在處理海量資料集時，此策略特別有用，因為它可以大幅減少需要處理的資料量。資料科學家和分析師可以透過使用排除法 isin 快速過濾掉無關資料，並專注於他們研究所需的特定資訊。

Jaisshree

更新於： 2023年8月10日

1K+ 瀏覽量

開啟您的職業生涯

透過完成課程獲得認證

開始學習

使用isin排除法過濾PySpark資料框

PySpark資料框

語法 - Isin ()

示例 1

輸出

示例 2

輸出

建立演示資料框

示例 1

輸出

示例 2

輸出

示例 3

語法

輸出

示例 4

語法

輸出

結論

開啟您的 職業生涯

開啟您的職業生涯