如何建立空的 PySpark DataFrame？

PySpark 是一個構建在 Apache Spark 之上的資料處理框架，廣泛用於大規模資料處理任務。它提供了一種高效的方式來處理大資料；它具有資料處理能力。

PySpark DataFrame 是一個分散式資料集合，組織成命名列。它類似於關係資料庫中的表，其中列表示特徵，行表示觀測值。DataFrame 可以從各種資料來源建立，例如 CSV、JSON、Parquet 檔案和現有的 RDD（彈性分散式資料集）。但是，有時可能需要出於各種原因建立空的 DataFrame，例如初始化模式或作為未來資料的佔位符。以下是在本教程中說明的兩個示例。

語法

要建立一個空的 PySpark DataFrame，我們需要遵循以下語法：

empty_df = spark.createDataFrame([], schema)

在此語法中，我們將空行列表和模式傳遞給 ‘createDataFrame()’ 方法，該方法返回一個空的 DataFrame。

示例

在此示例中，我們建立一個只有一個列的空 DataFrame。

#Importing necessary modules
from pyspark.sql.types import StructType, StructField, IntegerType

#creating a SparkSession object
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("EmptyDataFrame").getOrCreate()

#Defining the schema of the dataframe.
schema = StructType([StructField("age", IntegerType(), True)])

#Creating an empty dataframe.
empty_df = spark.createDataFrame([], schema)

#Printing the output.
empty_df.show()

在此示例中，首先，我們定義了一個只有一個名為 "age" 的 IntegerType 列的模式；然後，我們使用該模式建立了一個空的 DataFrame。最後，我們使用 ‘show()’ 方法顯示空 DataFrame。

輸出

+---+
|age|
+---+
+---+

示例

在此示例中，我們正在建立一個具有多個列的空 DataFrame。

#Importing the necessary modules.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession

#Creating a SparkSession object.
spark = SparkSession.builder.appName("EmptyDataFrame").getOrCreate()

#Defining the schema of the DataFrame
schema = StructType([
   StructField("col_1", StringType(), True),
   StructField("col_2", StringType(), True),
   StructField("col_3", StringType(), True),
   StructField("col_4", StringType(), True),
   StructField("col_5", StringType(), True),
   StructField("col_6", StringType(), True),
   StructField("col_7", StringType(), True),
   StructField("col_8", StringType(), True),
   StructField("col_9", StringType(), True),
   StructField("col_10", IntegerType(), True)
])

#Creating an empty DataFrame.
empty_df = spark.createDataFrame([], schema)

#Printing the output.
empty_df.show(10000)

在此示例中，我們首先定義了一個具有十個名為 "col_1" 到 "col_10" 的 ‘StringType’ 和 ‘IntegerType’ 列的模式，然後使用該模式建立了一個空的 DataFrame。最後，我們使用 ‘show()’ 方法顯示空 DataFrame，並顯示許多行（10,000）以證明 DataFrame 確實是空的。

我們注意到，即使輸出顯示了 10,000 行，DataFrame 也是空的，因為任何列中都不存在任何值。

輸出

+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+

在本教程中，我們學習瞭如何使用 ‘createDataFrame()’ 方法建立空的 PySpark DataFrame。我們說明了兩個示例，包括建立只有一個列的空 DataFrame，建立具有多個列的空 DataFrame。要建立空的 DataFrame，我們首先使用 ‘StructType()’ 和 ‘StructField()’ 定義一個模式，然後將其作為引數與空列表 ‘[]’ 一起傳遞給 ‘createDataFrame()’ 方法。這將建立一個具有指定模式的空 DataFrame。透過建立空的 PySpark DataFrame，我們可以提前設定 DataFrame 的結構，然後根據需要用資料填充它。這在處理大型資料集時非常有用，其中資料結構是預先知道的，但資料本身尚不可用。

Manthan Ghasadiya

更新於: 2023年4月10日

13K+ 瀏覽量

開啟你的職業生涯

透過完成課程獲得認證

開始學習