Python Pandas - 使用 HDF5 格式

在處理大型資料集時，我們可能會遇到“記憶體不足”錯誤。可以使用 HDF5 等最佳化的儲存格式來避免此類問題。pandas 庫提供了諸如 HDFStore 類和 讀/寫 API 等工具，以便輕鬆地儲存、檢索和操作資料，同時最佳化記憶體使用和檢索速度。

HDF5 代表 分層資料格式版本 5，是一種開原始檔格式，旨在有效地儲存大型、複雜和異構資料。它以類似於檔案系統的分層結構組織資料，其中組充當目錄，資料集充當檔案。HDF5 檔案格式可以以分層結構儲存不同型別的資料（例如陣列、影像、表格和文件），使其成為管理異構資料的理想選擇。

使用 HDFStore 處理 HDF5 格式

pandas 中的 HDFStore 類用於以類似字典的方式管理 HDF5 檔案。HDFStore 是一個類似字典的物件，可以使用 PyTables 庫以 HDF5 格式讀取和寫入 Pandas 資料。

示例：在 Pandas 中使用 HDFStore 建立 HDF5 檔案

以下是一個演示如何在 Pandas 中使用 pandas.HDFStore 類建立 HDF5 檔案 的示例。

import pandas as pd
import numpy as np

# Create the store using the HDFStore class
store = pd.HDFStore("store.h5")

# Display the store
print(store)

# It is important to close the store after use
store.close()

以下是上述程式碼的輸出 -

<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

注意：要在 pandas 中使用 HDF5 格式，您需要 pytables 庫。它是 pandas 的可選依賴項，必須使用以下命令之一單獨安裝 -

# Using pip
pip install tables

# or using conda installer
conda install pytables

示例：使用 Pandas 中的 HDFStore 將資料寫入/讀取到 HDF5

HDFStore 是一個類似字典的物件，因此我們可以使用鍵值對直接將資料寫入和讀取到 HDF5 儲存中。以下示例演示了相同的功能 -

import pandas as pd
import numpy as np

# Create the store
store = pd.HDFStore("store.h5")

# Create the data 
index = pd.date_range("1/1/2024", periods=8)
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

# Write Pandas data to the Store, which is equivalent to store.put('s', s)
store["s"] = s  
store["df"] = df

# Read Data from the store, which is equivalent to store.get('df')
from_store = store["df"]
print('Retrieved Data From the HDFStore:\n',from_store)

# Close the store after use
store.close()

以下是上述程式碼的輸出 -

Retrieved Data From the HDFStore:
                    A         B         C
2024-01-01  0.200467  0.341899  0.105715
2024-01-02 -0.379214  1.527714  0.186246
2024-01-03 -0.418122  1.008820  1.331104
2024-01-04  0.146418  0.587433 -0.750389
2024-01-05 -0.556524 -0.551443 -0.161225
2024-01-06 -0.214145 -0.722693  0.072083
2024-01-07  0.631878 -0.521474 -0.769847
2024-01-08 -0.361999  0.435252  1.177110

使用 Pandas API 讀取和寫入 HDF5 格式

Pandas 還提供了高階 API 來簡化與 HDFStore（也就是 HDF5 檔案）的互動。這些 API 允許您直接讀取和寫入 HDF5 檔案的資料，而無需手動建立 HDFStore 物件。以下是 pandas 處理 HDF5 檔案的主要 API -

pandas.read_hdf()：從 HDFStore 讀取資料。
pandas.DataFrame.to_hdf() 或 pandas.Series.to_hdf()：使用 HDFStore 將 Pandas 物件資料寫入 HDF5 檔案。

使用 to_hdf() 將 Pandas 資料寫入 HDF5

to_hdf() 函式允許您使用 HDFStore 將 pandas 物件（如 DataFrame 和 Series）直接寫入 HDF5 檔案。此函式提供了各種可選引數，例如壓縮、處理缺失值、格式選項等，允許您有效地儲存資料。

示例

此示例使用 DataFrame.to_hdf() 函式將資料寫入 HDF5 檔案。

import pandas as pd
import numpy as np

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]},index=['x', 'y', 'z']) 

# Write data to an HDF5 file using the to_hdf()
df.to_hdf("data_store.h5", key="df", mode="w", format="table")

print("Data successfully written to HDF5 file")

以下是上述程式碼的輸出 -

Data successfully written to HDF5 file

使用 read_hdf() 從 HDF5 讀取資料

pandas.read_hdf() 方法用於檢索儲存在 HDF5 檔案中的 Pandas 物件。它接受要從中讀取資料的檔名、檔案路徑或緩衝區。

示例

此示例演示瞭如何使用 pd.read_hdf() 方法從 HDF5 檔案“data_store.h5”中鍵“df”下儲存的資料。

import pandas as pd

# Read data from the HDF5 file using the read_hdf()
retrieved_df = pd.read_hdf("data_store.h5", key="df")

# Display the retrieved data
print("Retrieved Data:\n", retrieved_df.head())

以下是上述程式碼的輸出 -

Retrieved Data:
    A  B
x  1  4
y  2  5
z  3  6

使用 to_hdf() 將資料追加到 HDF5 檔案

可以透過使用 to_hdf() 函式的 mode="a" 選項將資料追加到現有的 HDF5 檔案。當您想要將新資料新增到檔案而不覆蓋現有內容時，這很有用。

示例

此示例演示瞭如何使用 to_hdf() 函式將資料追加到現有 HDF5 檔案。

import pandas as pd
import numpy as np

# Create a DataFrame to append
df_new = pd.DataFrame({'A': [7, 8], 'B': [1, 1]},index=['i', 'j'])

# Append the new data to the existing HDF5 file
df_new.to_hdf("data_store.h5", key="df", mode="a", format="table", append=True)

print("Data successfully appended")

# Now read data from the HDF5 file using the read_hdf()
retrieved_df = pd.read_hdf("data_store.h5", key='df')

# Display the retrieved data
print("Retrieved Data:\n", retrieved_df.head())

以下是上述程式碼的輸出 -

Data successfully appended
Retrieved Data:
    A  B
x  1  4
y  2  5
z  3  6
i  7  1
j  8  1

列印頁面