-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Closed
Labels
API DesignMaster TrackerHigh level tracker for similar issuesHigh level tracker for similar issuesReshapingConcat, Merge/Join, Stack/Unstack, ExplodeConcat, Merge/Join, Stack/Unstack, Explode
Description
originally #5190
xref #9816
xref #3942
This issue is for creating a unified API to Series & DataFrame sorting methods. Panels are not addressed (yet) but a unified API should be easy to extend to them. Related are #2094, #5190, #6847, #7121, #2615. As discussion proceeds, this post will be edited.
For reference, the 0.14.1 signatures are:
Series.sort(axis=0, ascending=True, kind='quicksort', na_position='last', inplace=True)
Series.sort_index(ascending=True)
Series.sortlevel(level=0, ascending=True, sort_remaining=True)
DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort',
na_position='last')
DataFrame.sort_index(axis=0, by=None, ascending=True, inplace=False, kind='quicksort',
na_position='last')
DataFrame.sortlevel(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True)Proposed unified signature for Series.sort and DataFrame.sort (except Series version retains current inplace=True):
def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False,
kind='quicksort', na_position='last', sort_remaining=True):
"""Sort by labels (along either axis), by the values in column(s) or both.
If both, labels take precedence over columns. If neither is specified,
behavior is object-dependent: series = on values, dataframe = on index.
Parameters
----------
by : column name or list of column names
if not None, sort on values in specified column name; perform nested
sort if list of column names specified. this argument ignored by series
axis : {0, 1}
sort index/rows (0) or columns (1); for Series, only 0 can be specified
level : int or level name or list of ints or list of column names
if not None, sort on values in specified index level(s)
ascending : bool or list of bool
Sort ascending vs. descending. Specify list for multiple sort orders.
inplace : bool
if True, perform operation in-place (without creating new instance)
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}
Choice of sorting algorithm. See np.sort for more information.
‘mergesort’ is the only stable algorithm. For data frames, this option is
only applied when sorting on a single column or label.
na_position : {'first', 'last'}
‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end
sort_remaining : bool
if true and sorting by level and index is multilevel, sort by other levels
too (in order) after sorting by specified level
"""The sort_index signatures change too and sort_columns is created:
Series.sort_index(level=0, ascending=True, inplace=False, kind='quicksort',
na_position='last', sort_remaining=True)
DataFrame.sort_index(level=0, axis=0, by=None, ascending=True, inplace=False,
kind='quicksort', na_position='last', sort_remaining=True)
# by is DEPRECATED, see change 7
DataFrame.sort_columns(by=None, level=0, ascending=True, inplace=False,
kind='quicksort', na_position='last', sort_remaining=True)
# or maybe level=NoneProposed changes:
makemaybe, possibly in 1.0inplace=Falsedefault (changesSeries.sort)- new
byargument to accept column-name/list-of-column-names in first position- deprecate
columnskeyword ofDataFrame.sort, replaced withby(df.sort signature would need to retain columns keyword until finally removed but it's not shown in proposal) - don't allow tuples to access levels of multi-index (
columnsarg ofDataFrame.sortallows tuples); use newlevelargument instead - don't swap order of
by/axisinDataFrame.sort_index(see change 7) - this argument is ignored by series but
axisis too so for the sake of working with dataframes, it gets first position
- deprecate
- new
levelargument to accept integer/level-name/list-of-ints/list-of-level-names for sorting (multi)index by particular level(s)- replaces tuple behavior of
columnsarg ofDataFrame.sort - add
levelargument tosort_indexin first position so level(s) of multilevel index can be specified; this makessort_index==sortlevel(see change 8) - also adds
sort_remainingarg to handle multi-level indexes
- replaces tuple behavior of
- new method
DataFrame.sort_columns==sort(axis=1)(see syntax below) - deprecate
Series.ordersince change 1 makesSeries.sortequivalent (?) - add
inplace,kind, andna_positionarguments toSeries.sort_index(to matchDataFrame.sort_index);byandaxisargs are not added since they don't make sense for series - deprecate and eventually remove
byargument fromDataFrame.sort_indexsince it makessort_indexequivalent tosort - deprecate
sortlevelsince change 3b makessort_indexequivalent
Notes:
- default behavior of
sortis still object-dependent: for series, sorts by values and for data frames, sorts by index - new
levelarg makessort_indexandsortlevelequivalent. if sortlevel is retained:- should rename
sortleveltosort_levelfor naming conventions Series.sortlevelshould haveinplaceargument added- maybe don't add
levelandsort_remainingargs tosort_indexso it's not equivalent tosort_level(intentionally limiting sort_index seems like a bad idea though)
- should rename
- it's unclear if default should be
level=Noneforsort_columns. probably not since level=None falls back to level=0 anyway - both
byandaxisarguments should be ignored bySeries.sort
Syntax:
- dataframes
sort()==sort(level=0)==sort_index()==sortlevel()- without columns or level specified, defaults to current behavior of sort on index
sort(['A','B'])- since columns are specified, default index sort should not occur; sorting only happens using columns 'A' and 'B'
sort(level='spam')==sort_index('spam')==sortlevel('spam')- sort occurs on row index named 'spam' or level of multi-index named 'spam'
sort(['A','B'], level='spam')levelcontrols here even though columns are specified so sort happens along row index named 'spam' first, then nested sort occurs using columns 'A' and 'B'
sort(axis=1)==sort(axis=1, level=0)==sort_columns()- since data frames default to sort on index, leaving level=None is the same as level=0
sort(['A','B'], axis=1)==sort_columns(['A','B'])- as with preceding example, level=None becomes level=0 in sort_columns
sort(['A','B'], axis=1, level='spam')==sort_columns(['A','B'], level='spam')axiscontrolslevelso sort will be on columns named 'A' and 'B' in column index named 'spam'
- series:
sort()==order()-- sorts on values- with
levelspecified, sorts on index/named index/level of multi-index:sort(level=0)==sort_index()==sortlevel()sort(level='spam')==sort_index('spam')==sortlevel('spam')
Comments welcome.
Metadata
Metadata
Assignees
Labels
API DesignMaster TrackerHigh level tracker for similar issuesHigh level tracker for similar issuesReshapingConcat, Merge/Join, Stack/Unstack, ExplodeConcat, Merge/Join, Stack/Unstack, Explode