-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Description
I've lately worked on making Series.map simpler as part of implementing the na_action on all ExtensionArray.map methods. As part of that, I made #52033. That PR (and the current SeriesApply.apply_standard more generally) very clearly shows how Series.apply & Series.map are very similar, but different enough for it to be confusing when it's a good idea to use one over the other and when Series.apply especially is a bad idea to use.
I propose doing some changes in how Series.apply work when given a single callable. This change is somewhat fundamental, so I understand that this can be controversial, but I believe that this change will be for the better for Pandas. I'm of course ready for discussion and possibly (but hopefully not 😄 ) disagreement. We'll see.
I'll show the proposal below. First I'll show what the similarities and differences are between the two methods, then what the problem is in my view with current API, and then my proposed solution.
Similarities and differences between Series.apply and Series.map
The similarity between the methods is especially that they both fall back to use Series._map_values and there use algorithms.map_array or ExtensionArray.map as relevant.
The differences are many, but each one is relative minor:
Series.applyhas aconvert_dtypeparameter, whichSeries.mapdoesn'tSeries.maphas ana_actionparameter, whichSeries.applydoesn'tSeries.applycan take advantage of numpy ufuncs, whichSeries.mapcan'tSeries.applycan takeargsand**kwargs, whichSeries.mapcan'tSeries.applywill return a Dataframe, if its result is a listlike of Series, whichSeries.mapwon'tSeries.applyis more general and can take a string, e.g."sum", or lists or dicts of inputs whichSeries.mapcan't.
Also, Series.apply is a bit of a parent method of Series.agg & Series.transform.
The problems
The above similarities and many minor differences makes for (IMO) confusing and too complex rules for when its a good idea to use .apply over .map to do operations, and vica versa. I will show some examples below.
First some setup:
>>> import numpy as np
>>> import pandas as pd
>>>
>>> small_ser = pd.Series([1, 2, 3])
>>> large_ser = pd.Series(range(100_000))1: string vs numpy funcs in Series.apply
>>> small_ser.apply("sum")
6
>>> small_ser.apply(np.sum)
0 1
1 2
2 3
dtype: int64It will surprise new users that these two give different results. Also, anyone using the second pattern is probably making a mistake.
Note that giving np.sum to DataFrame.apply aggregates properly:
>>> small_ser.to_frame().apply(np.sum)
0 6
dtype: int641.5 Callables vs. list/dict of callables (added 2023-04-07)
>>> small_ser.apply(np.sum)
0 1
1 2
2 3
dtype: int64
>>> small_ser.apply([np.sum])
sum 6
dtype: int64Also with non-numpy callables:
>>> small_ser.apply(lambda x: x.sum())
AttributeError: 'int' object has no attribute 'sum'
>>> small_ser.apply([lambda x: x.sum()])
<lambda> 6
dtype: int64In both cases above the difference is that Series.apply operates element-wise, if given a callable, but series-wise if given a list/dict of callables.
2. Functions in Series.apply (& Series.transform)
The Series.apply doc string have examples with using lambdas, but lambdas in Series.apply is a bad practices because of bad performance:
>>> %timeit large_ser.apply(lambda x: x + 1)
24.1 ms ± 88.8 µs per loopCurrently, Series does not have a method that makes a callable operate on a series' data. Instead users need to use Series.pipe for that operation in order for the operation to be efficient:
>>> %timeit large_ser.pipe(lambda x: x + 1)
44 µs ± 363 ns per loop(The reason for the above performance differences is that apply gets called on each single element, while pipe calls x.__add__(1), which operates on the whole array).
Note also that .pipe operates on the Series while applycurrently operates on each element in the data, so there is some differences that may have some consequence in some cases.
Also notice that Series.transform has the same performance problems:
>>> %timeit large_ser.transform(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop3. ufuncs in Series.apply vs. in Series.map
Performance-wise, ufuncs are fine in Series.apply, but not in Series.map:
>>> %timeit large_ser.apply(np.sqrt)
71.6 µs ± 1.17 µs per loop
>>> %timeit large_ser.map(np.sqrt)
63.9 ms ± 69.5 µs per loopIt's difficult for users to understand why one is fast and the other slow (answer: only apply correctly works with ufuncs).
It is also difficult to understand why ufuncs are fast in apply, while other callables are slow in apply (answer: it's because ufuncs operate on the whole array, while other callables operate elementwise).
4. callables in Series.apply is bad, callables in SeriesGroupby.apply is fine
I showed above that using (non-ufunc) callables in Series.apply is bad performancewise. OTOH using them in SeriesGroupby.apply is fine:
>>> %timeit large_ser.apply(lambda x: x + 1)
24.3 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit large_ser.groupby(large_ser > 50_000).apply(lambda x: x + 1)
11.3 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)Note that most of the time in the groupby was used doing groupby ops, so the actual difference in the apply op is much larger, and similar to example 2 above.
Having callables being ok to use in the SeriesGroupby.apply method, but not in the Series.Apply is confusing IMO.
5: callables in Series.apply that return Series transform data to a DataFrame
Series.apply has an exception that if the callable returns a list-like of Series, the Series will be concatenated to a DataFrame. This op is very slow operation and hence generally a bad idea:
>>> small_ser.apply(lambda x: pd.Series([x, x+1], index["a", "b"]))
a b
0 0 1
1 1 2
2 2 3
>>> %timeit large_ser.apply(lambda x: pd.Series([x, x+1]))
# timing takes too long to measureIt's probably never a good idea to use this pattern, and e.g. .pipe is much faster, so e.g. large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x+1})) will be much faster. If we really do need operation on single element in that fashion it is still possible using pipe, e.g. large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x.map(some_func))) and also just directly pd.DataFrame({"a": large_ser, "b": large_ser.map(some_func))).
So giving callables that return Series to Series.apply is a bad pattern and should be discouraged. (If users really want to do that pattern, they should build the list of Series themselves and take responsibilty for the slowdown).
6. Series.apply vs. Series.agg
The doc string for Series.agg says about the method's func parameter: "If a function, must ... work when passed ... to Series.apply". But compare these:
>>> small_ser.apply(np.sum)
0 1
1 2
2 3
dtype: int64
>>> small_ser.agg(np.sum)
6You could argue the doc string is correct (it doesn't raise...), but you could also argue it isn't (because the results are different). I'd personally expect "must work when passed to series.apply" would mean "gives the same result when passed to to agg and to apply".
7. dictlikes vs. listlikes in Series.apply (added 2023-06-04)
Giving a list of transforming arguments to Series.apply returns a DataFrame:
>>> small_ser.apply(["sqrt", np.abs])
sqrt absolute
0 1.000000 1
1 1.414214 2
2 1.732051 3But giving a dict of transforming arguments returns a Series with a MultiIndex:
>>> small_ser.apply({"sqrt" :"sqrt", "abs" : np.abs})
sqrt 0 1.000000
1 1.414214
2 1.732051
abs 0 1.000000
1 2.000000
2 3.000000
dtype: float64These two should give same-shaped output for consistency. Using Series.transform instead of Series.apply, it returns a DataFrame in both cases and I think the dictlike example above should return a DataFrame similar to the listlike example.
Minor additional info: listlikes and dictlikes of aggregation arguments do behave the same, so this is only a problem with dictlikes of transforming arguments when using apply.
Proposal
With the above in mind, I propose that:
Series.applytakes callables that always operate on the series. I.e. letseries.apply(func)be similar tofunc(series)+ the needed additional functionality.Series.maptakes callables that operate on each element individually. I.e.series.map(func)will be similar to the currentseries._map_values(func)+ the needed additional functionality.- The parameter
convert_dtypewill be deprecated inSeries.apply(already done in DEPR: Deprecate the convert_dtype param in Series.Apply #52257). - A parameter
convert_dtypewill NOT be added toSeries.map(comment) by @rhshadrach). - The ability in
Series.applyto convert alist[Series]to a DataFrame will be deprecated (already done in DEPR: Deprecate returning a DataFrame in SeriesApply.apply_standard #52123). - The ability to convert a
list[Series]to a DataFrame will NOT be added toSeries.map. - The changes made to
Series.applywill propagate toSeries.aggandSeries.transform.
The difference between Series.apply() & Series.map() will then be that:
Series.apply()makes the passed-in callable operate on the series, similarly to how(DataFrame|SeriesGroupby|DataFrameGroupBy).apply.operate on series. This is very fast and can do almost anything,Series.map()makes the passed-in callable operate on each series data elements individually. This is very flexible, but can be very slow, so should only be used ifSeries.applycan't do it.
so, IMO, this API change will help make Pandas Series.(apply|map) API simpler without losing functionality and let their functionality be explainable in a simple manner, which would be a win for Pandas.
Deprecation process
The cumbersome part of the deprecation process will be to change Series.apply to only work array-wise, ie. to do func(series._values) always. This can be done by adding an array_ops_only parameter to Series.apply, so:
>>> def apply(self, ..., array_ops_only: bool | NoDefault=no_default, ...):
if array_ops_only is no_default:
warn("....")
array_ops_only = False
...and then change the meaning of that parameter in pandas v3.0 again to make people remove from their code.
The other changes are more easy: convert_dtype in Series.apply will be deprecated just like you would normally for method parameters. The ability to convert a list of Series to a DataFrame will emit a deprecation warning, when that code path is encountered.