如何在Python Pandas中使用字典序切片選擇資料集的子集？

介紹

Pandas具有雙重選擇功能，可以使用索引位置或索引標籤來選擇資料集的子集。在這篇文章中，我將向您展示如何“使用字典序切片選擇資料集的子集”。

谷歌上有大量的dataset。在kaggle.com上搜索電影資料集。這篇文章使用kaggle上的電影資料集。

操作方法

1. 匯入電影資料集，只包含此示例所需的列。

import pandas as pd
import numpy as np
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",index_col="title",
usecols=["title","budget","vote_average","vote_count"])
movies.sample(n=5)

標題	預算	平均評分	評分人數
小聲音	0	6.6	61
長大2	80000000	5.8	1155
我們最好的歲月	2100000	7.6	143
象牙	2800000	5.1	366
黃海決戰	0	5.8	29

2. 我總是建議對索引進行排序，特別是如果索引由字串組成。當您的索引已排序時，您會注意到在處理大型資料集時的區別。

如果我不對索引排序會怎樣？

沒問題，您的程式碼將永遠執行下去。開玩笑的，如果索引標籤未排序，則pandas必須逐個遍歷所有標籤才能匹配您的查詢。想象一下沒有索引頁的牛津詞典，您將如何操作？對索引進行排序後，您可以快速跳轉到要提取的標籤，Pandas也是如此。

讓我們首先檢查我們的索引是否已排序。

# check if the index is sorted or not ?
movies.index.is_monotonic

False

3. 顯然，索引未排序。我們將嘗試選擇以A%開頭的電影。這就像寫

select * from movies where title like'A%'

movies.loc["Aa":"Bb"]

---------------------------------------------------------------------------
ValueErrorTraceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4844try:
-> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(se lf, label, side)
4805
-> 4806raise ValueError("index must be monotonic increasing or decreasing")
4807

ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

KeyErrorTraceback (most recent call last)
in
----> 1 movies.loc["Aa": "Bb"]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key)
1766
1767maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769
1770def _is_scalar_access(self, key: Tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1910if isinstance(key, slice):
1911self._validate_key(key, axis)
-> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key):
1914return self._getbool_axis(key, axis=axis)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_ob j, axis)
1794
1795labels = obj._get_axis(axis)
-> 1796indexer = labels.slice_indexer(
1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798)

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
4711slice(1, 3)
4712"""
-> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=ki nd)
4714
4715# return a slice

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, en d, step, kind)
4924start_slice = None
4925if start is not None:
-> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None:
4928start_slice = 0

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4846except ValueError:
4847# raise the original KeyError
-> 4848raise err
4849
4850if isinstance(slc, np.ndarray):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4840# we need to look up the label
4841try:
-> 4842slc = self.get_loc(label) 4843except KeyError as err:
4844try:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method,

tolerance)
2646return self._engine.get_loc(key)
2647except KeyError:
-> 2648return self._engine.get_loc(self._maybe_cast_indexer(key))
2649indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer() KeyError: 'Aa'

4. 將索引按升序排序，並嘗試相同的命令以利用排序進行字典序切片。

True

5. 現在我們的資料已設定並準備好進行字典序切片。現在讓我們選擇所有以字母A到字母B開頭的電影標題。

標題	預算	平均評分	評分人數
遺棄	25000000	4.6	45
被遺棄的	0	5.8	27
綁架	35000000	5.6	961
阿伯丁	0	7.0	6
昨晚	12500000	6.0	210
...	...	...	...
人猿星球大戰	1700000	5.5	215
一年一度的戰鬥	20000000	5.9	88
洛杉磯之戰	70000000	5.5	1448
宇宙戰場	44000000	3.0	255
戰艦	209000000	5.5	2114

標題	預算	平均評分	評分人數
時空駭客	62000000	5.4	703
xXx：國家聯盟	60000000	4.7	549
xXx	70000000	5.8	1424
異次元駭客	15000000	6.7	475
[REC]²	5600000	6.4	489

預算平均評分評分人數標題

由於資料按反序排序，因此很容易看到空DataFrame。讓我們反轉字母並再次執行。

標題	預算	平均評分	評分人數
B-Girl	0	5.5	7
阿育吠陀：存在的藝術	300000	5.5	3
我們走吧	17000000	6.7	189
清醒	86000000	6.3	395
復仇者聯盟：奧創紀元	280000000	7.3	6767
...	...	...	...
昨晚	12500000	6.0	210
阿伯丁	0	7.0	6
綁架	35000000	5.6	961
被遺棄的	0	5.8	27
遺棄	25000000	4.6	45

Kiran P

更新於：2020年11月10日

243 次瀏覽

啟動您的職業生涯

透過完成課程獲得認證

開始

如何在Python Pandas中使用字典序切片選擇資料集的子集？

介紹

操作方法

如果我不對索引排序會怎樣？

預算 平均評分 評分人數 標題

啟動您的職業生涯

預算平均評分評分人數標題