使用Pyspark中的Dropna清理資料

為了確保資料準確、可靠並適合預期的分析，資料清理是任何資料分析或資料科學工作中至關重要的一步。Pyspark 中的資料清理函式（如 dropna）使其成為處理大型資料集的強大工具。

Pyspark 中的 dropna 函式允許您從 DataFrame 中刪除包含缺失值或空值的行。缺失值或空值可能由於各種原因出現在 DataFrame 中，例如資料不完整、資料輸入錯誤或資料格式不一致。刪除這些行有助於確保資料質量，以便進行後續分析。

Dropna 是一款多功能函式，允許您指定刪除行的條件。您可以指定要沿其刪除行的軸（0 表示行，1 表示列），保留行所需的最小非空值閾值，以及在檢查缺失值時要考慮的列的子集。此外，dropna 支援不同的處理缺失值的方法，例如刪除任何缺失值的行，僅刪除特定列中缺失值的行，或根據時間閾值刪除行。

在 Pyspark 中使用 dropna 可以顯著提高資料質量和可靠性。透過刪除包含缺失值或空值的行，您可以確保您的分析基於完整且準確的資料。憑藉其靈活性和易用性，dropna 是任何 Pyspark 使用者資料清理工具包中必不可少的工具。

在本文中，我們將討論清理 DataFrame 的過程以及使用 dropna() 函式來實現這一目標。清理 DataFrame 的主要目的是確保它包含準確可靠的資料，這些資料適合分析。

dropna() 函式的語法如下

df.dropna(how="any", thresh=None, subset=None)

其中 df 是要清理的 DataFrame。該函式接受三個引數

how − 此引數指定如果任何值為 null，則是否刪除行或列。如果值為“any”，則如果任何值為 null，則將刪除該行或列。如果值為“all”，則僅當所有值為 null 時才會刪除該行或列。
thresh − 此引數指定保留行或列所需的最小非空值數。如果行或列中的非空值數小於 thresh 值，則將刪除該行或列。
subset − 此引數指定在檢查 null 值時要考慮的列的子集。如果指定子集中任何值為 null，則將刪除該行或列。

透過使用帶適當引數的 dropna() 函式，您可以清理 DataFrame 並刪除任何 null 或缺失值。這很重要，因為 null 或缺失值可能導致分析不準確，刪除它們將提高資料的準確性和可靠性。此外，dropna() 是一款多功能函式，可用於小型和大型資料集，使其成為任何 Pyspark 資料清理專案中必不可少的工具。

在利用 dropna 方法刪除 null 值之前，我們必須首先建立一個 Pyspark DataFrame。建立 DataFrame 後，我們可以繼續應用 dropna 方法以消除 DataFrame 中存在的任何 null 值。

能夠執行本教程中的程式碼的先決條件是安裝 pyspark 模組。

以下命令將安裝 pyspark 模組。

命令

pip3 install pyspark

請考慮以下所示程式碼。

示例

# importing necessary libraries
from pyspark.sql import SparkSession

# function to create new SparkSession
def create_session():
	spk = SparkSession.builder \
		.master("local") \
		.appName("Employee_detail.com") \
		.getOrCreate()
	return spk

# function to create DataFrame from data and schema
def create_df(spark, data, schema):
	df1 = spark.createDataFrame(data, schema)
	return df1


if __name__ == "__main__":

	# calling function to create SparkSession
	spark = create_session()

	# creating sample data with different data types
	input_data = [(1, "John", "Data Scientist", "Seattle"),
				(2, None, "Software Developer", None),
				(3, "Emma", "Data Analyst", "New York"),
				(4, None, None, "San Francisco"),
				(5, "Andrew", "Android Developer", "Los Angeles"),
				(6, "Sarah", None, None),
				(None, None, None, None)]
	
	# creating schema for DataFrame
	schema = ["Id", "Name", "Job Profile", "City"]

	# calling function to create dataframe
	df = create_df(spark, input_data, schema)

	# displaying the created DataFrame
	df.show()

解釋

此程式碼演示瞭如何在 PySpark 中建立新的 SparkSession 和 DataFrame。為此，它匯入了建立 SparkSession 所需的庫。

接下來，定義了一個名為 create_session() 的函式，該函式設定並配置一個新的 SparkSession。此函式指定 Spark 應在單個節點上本地執行，設定應用程式的名稱，並建立新的 SparkSession 或返回現有的 SparkSession。

接下來定義 create_df() 函式，該函式使用輸入資料和模式建立一個新的 DataFrame。此函式以 SparkSession、輸入資料和模式作為輸入，並返回一個新的 DataFrame。

輸入資料是元組列表，其中每個元組表示 DataFrame 中的一行。模式是列名列表，其中每個名稱對應於 DataFrame 中的一列。

最後，程式碼的主要部分呼叫 create_session() 函式以建立一個新的 SparkSession，定義 DataFrame 的輸入資料和模式，並呼叫 create_df() 函式以使用輸入資料和模式建立一個新的 DataFrame。然後，使用 .show() 方法列印生成的 DataFrame。

要執行以上程式碼，我們需要執行以下所示的命令。

命令

python3 main.py

執行以上命令後，我們可以預期輸出與以下所示的輸出相同。

輸出

+----+------+------------------+------------------------------+
|  Id|  Name|       Job Profile|           City|
+----+------+------------------+------------------------------+
|   1|  John|     Data Scientist|           Seattle|
|   2|  null|       Software Developer|  null|
|   3|  Emma|  Data Analyst|             New York|
|   4|  null|       null|                            San Francisco|
|   5|Andrew|  Android Developer|    Los Angeles|
|   6| Sarah|    null|                            null|
|   null|  null|       null|                        null|
+----+------+------------------+-----------------------------+

使用 PySpark 中的 any 引數清理資料。

在下面的程式碼中，dropna() 函式以引數 how="any" 呼叫。此引數指定 DataFrame 中包含任何 Null 值的任何行或列都將被刪除。

請考慮以下所示程式碼。

示例

# importing necessary libraries
from pyspark.sql import SparkSession

# function to create new SparkSession
def create_session():
	spk = SparkSession.builder \
		.master("local") \
		.appName("Employee_detail.com") \
		.getOrCreate()
	return spk

# function to create DataFrame from data and schema
def create_df(spark, data, schema):
	df1 = spark.createDataFrame(data, schema)
	return df1


if __name__ == "__main__":

	# calling function to create SparkSession
	spark = create_session()

	# creating sample data with different data types
	input_data = [(1, "John", "Data Scientist", "Seattle"),
				(2, None, "Software Developer", None),
				(3, "Emma", "Data Analyst", "New York"),
				(4, None, None, "San Francisco"),
				(5, "Andrew", "Android Developer", "Los Angeles"),
				(6, "Sarah", None, None),
				(None, None, None, None)]
	
	# creating schema for DataFrame
	schema = ["Id", "Name", "Job Profile", "City"]

	# calling function to create dataframe
	df = create_df(spark, input_data, schema)

	# displaying the created DataFrame
	# df.show()

# if any row of the is having any Null
# value we are dropping that
# rows
df = df.dropna(how="any")
df.show()

要執行以上程式碼，我們需要執行以下所示的命令。

命令

python3 main.py

執行以上命令後，我們可以預期輸出與以下所示的輸出相同。

輸出

+---+------+-----------------+--------------------+
| Id|  Name|      Job Profile|              City|
+---+------+-----------------+--------------------+
|  1|  John|       Data Scientist|          Seattle|
|  3|  Emma|     Data Analyst|            New York|
|  5| Andrew|     Android Developer|  Los Angeles|
+---+------+-----------------+--------------------+

使用 PySpark 中的 all 引數清理資料。

在下面的程式碼中，dropna() 函式以引數 how="all" 呼叫。此引數指定 DataFrame 中僅包含 Null 值的任何行或列都將被刪除。

請考慮以下所示程式碼。

示例

# importing necessary libraries
from pyspark.sql import SparkSession

# function to create new SparkSession
def create_session():
	spk = SparkSession.builder \
		.master("local") \
		.appName("Employee_detail.com") \
		.getOrCreate()
	return spk

# function to create DataFrame from data and schema
def create_df(spark, data, schema):
	df1 = spark.createDataFrame(data, schema)
	return df1


if __name__ == "__main__":

	# calling function to create SparkSession
	spark = create_session()

	# creating sample data with different data types
	input_data = [(1, "John", "Data Scientist", "Seattle"),
				(2, None, "Software Developer", None),
				(3, "Emma", "Data Analyst", "New York"),
				(4, None, None, "San Francisco"),
				(5, "Andrew", "Android Developer", "Los Angeles"),
				(6, "Sarah", None, None),
				(None, None, None, None)]
	
	# creating schema for DataFrame
	schema = ["Id", "Name", "Job Profile", "City"]

	# calling function to create dataframe
	df = create_df(spark, input_data, schema)

	# displaying the created DataFrame
	# df.show()

# if any row of the is having all Null
# value we are dropping that
# rows
df = df.dropna(how="all")
df.show()

要執行以上程式碼，我們需要執行以下所示的命令。

命令

python3 main.py

執行以上命令後，我們可以預期輸出與以下所示的輸出相同。

輸出

+---+------+------------------+--------------------------+
| Id|  Name|       Job Profile|            City|
+---+------+------------------+--------------------------+
|  1|  John|     Data Scientist|          Seattle|
|  2|  null|       Software Developer| null|
|  3|  Emma|  Data Analyst|            New York|
|  4|  null|       null|                          San Francisco|
|  5|Andrew|  Android Developer|  Los Angeles|
|  6| Sarah|    null|                          null|
+---+------+------------------+--------------------------+

使用 PySpark 中的 thresh 引數清理資料。

在下面的程式碼中，dropna() 函式以引數 thresh=2 呼叫。此引數指定 DataFrame 中包含少於兩個非空值的任何行或列都將被刪除。

使用 PySpark 中的 thresh 引數清理資料。

示例

# importing necessary libraries
from pyspark.sql import SparkSession

# function to create new SparkSession
def create_session():
	spk = SparkSession.builder \
		.master("local") \
		.appName("Employee_detail.com") \
		.getOrCreate()
	return spk

# function to create DataFrame from data and schema
def create_df(spark, data, schema):
	df1 = spark.createDataFrame(data, schema)
	return df1


if __name__ == "__main__":

	# calling function to create SparkSession
	spark = create_session()

	# creating sample data with different data types
	input_data = [(1, "John", "Data Scientist", "Seattle"),
				(2, None, "Software Developer", None),
				(3, "Emma", "Data Analyst", "New York"),
				(4, None, None, "San Francisco"),
				(5, "Andrew", "Android Developer", "Los Angeles"),
				(6, "Sarah", None, None),
				(None, None, None, None)]
	
	# creating schema for DataFrame
	schema = ["Id", "Name", "Job Profile", "City"]

	# calling function to create dataframe
	df = create_df(spark, input_data, schema)

	# displaying the created DataFrame
	# df.show()

# if thresh value is not
# satisfied then dropping
# that row
df = df.dropna(thresh=2)
df.show()

要執行以上程式碼，我們需要執行以下所示的命令。

命令

python3 main.py

執行以上命令後，我們可以預期輸出與以下所示的輸出相同。

輸出

+---+------+------------------+--------------------------+
| Id|  Name|       Job Profile|            City|
+---+------+------------------+--------------------------+
|  1|  John|     Data Scientist|          Seattle|
|  2|  null|       Software Developer| null|
|  3|  Emma|  Data Analyst|            New York|
|  4|  null|       null|                          San Francisco|
|  5|Andrew|  Android Developer|  Los Angeles|
|  6| Sarah|    null|                          null|
+---+------+------------------+--------------------------+

使用 PySpark 中的 subset 引數清理資料。

在以下程式碼中，我們在 dropna() 函式中傳遞了 subset='City' 引數，它是 City 列的列名，如果該列中存在任何 NULL 值，那麼我們將從 Dataframe 中刪除該行。

請考慮以下所示程式碼。

示例

# importing necessary libraries
from pyspark.sql import SparkSession

# function to create new SparkSession
def create_session():
	spk = SparkSession.builder \
		.master("local") \
		.appName("Employee_detail.com") \
		.getOrCreate()
	return spk

# function to create DataFrame from data and schema
def create_df(spark, data, schema):
	df1 = spark.createDataFrame(data, schema)
	return df1


if __name__ == "__main__":

	# calling function to create SparkSession
	spark = create_session()

	# creating sample data with different data types
	input_data = [(1, "John", "Data Scientist", "Seattle"),
				(2, None, "Software Developer", None),
				(3, "Emma", "Data Analyst", "New York"),
				(4, None, None, "San Francisco"),
				(5, "Andrew", "Android Developer", "Los Angeles"),
				(6, "Sarah", None, None),
				(None, None, None, None)]
	
	# creating schema for DataFrame
	schema = ["Id", "Name", "Job Profile", "City"]

	# calling function to create dataframe
	df = create_df(spark, input_data, schema)

	# displaying the created DataFrame
	# df.show()

# if the subset column any value
# is NULL then we drop that row
df = df.dropna(subset="City")
df.show()

要執行以上程式碼，我們需要執行以下所示的命令。

命令

python3 main.py

執行以上命令後，我們可以預期輸出與以下所示的輸出相同。

輸出

+---+------+-----------------+--------------------------------+
| Id|  Name|      Job Profile|             City|
+---+------+-----------------+--------------------------------+
|  1|  John|         Data Scientist|         Seattle|
|  3|  Emma|      Data Analyst|           New York|
|  4|  null|           null|                         San Francisco|
|  5|Andrew|      Android Developer|  Los Angeles|
+---+------+-----------------+--------------------------------+

結論

總之，在進行任何分析或建模之前，資料清理是資料預處理中不可或缺的一部分。在 Python 中，Pandas 庫中的 dropna() 函式和 PySpark DataFrame API 提供了一種簡單有效的方法來從 DataFrame 中刪除包含 Null 值的行或列。

透過指定不同的引數（如 how 和 thresh），使用者可以選擇函式的行為並自定義清理過程。總的來說，dropna() 函式是資料清理的強大工具，有助於提高資料質量並提高後續任何分析或建模的準確性。

Mukul Latiyan

更新於: 2023年8月3日

400 次檢視

開啟你的職業生涯

透過完成課程獲得認證

開始

使用Pyspark中的Dropna清理資料

示例

解釋

輸出

使用 PySpark 中的 any 引數清理資料。

示例

輸出

使用 PySpark 中的 all 引數清理資料。

示例

輸出

使用 PySpark 中的 thresh 引數清理資料。

示例

輸出

使用 PySpark 中的 subset 引數清理資料。

示例

輸出

結論

開啟你的 職業生涯

開啟你的職業生涯