API: make the func in Series.apply always operate on the Series

I've lately worked on making `Series.map` simpler as part of implementing the `na_action` on all `ExtensionArray.map` methods. As part of that, I made #52033. That PR (and the current `SeriesApply.apply_standard` more generally) very clearly shows how `Series.apply` & `Series.map` are very similar, but different enough for it to be confusing  when it's a good idea to use one over the other and when `Series.apply` especially is a bad idea to use.

I propose doing some changes in how `Series.apply` work when given a single callable. This change is somewhat fundamental, so I understand that this can be controversial, but I believe that this change will be for the better for Pandas. I'm of course ready for discussion and possibly (but hopefully not 😄 ) disagreement. We'll see.

I'll show the proposal below. First I'll show what the similarities and differences are between the two methods, then what the problem is in my view with current API, and then my proposed solution.

## Similarities and differences between `Series.apply` and `Series.map`

The similarity between the methods is especially that they both fall back to use `Series._map_values` and there use `algorithms.map_array` or `ExtensionArray.map` as relevant.

The differences are many, but each one is relative minor:

1. `Series.apply` has a `convert_dtype` parameter, which `Series.map` doesn't
2. `Series.map` has a `na_action` parameter, which `Series.apply` doesn't
3. `Series.apply` can take advantage of numpy ufuncs, which `Series.map` can't
4. `Series.apply` can take `args` and `**kwargs`, which `Series.map` can't
5. `Series.apply` will return a Dataframe, if its result is a listlike of Series, which `Series.map` won't
6. `Series.apply` is more general and can take a string, e.g. `"sum"`, or lists or dicts of inputs which `Series.map` can't. 

Also, `Series.apply` is a bit of a parent method of `Series.agg` & `Series.transform`.

# The problems

The above similarities and many minor differences makes for (IMO) confusing and too complex rules for when its a good idea to use `.apply` over `.map` to do operations, and vica versa. I will show some examples below.

First some setup:

```python
>>> import numpy as np
>>> import pandas as pd 
>>>
>>> small_ser = pd.Series([1, 2, 3])
>>> large_ser = pd.Series(range(100_000))
```

## 1: string vs numpy funcs in `Series.apply`

```python
>>> small_ser.apply("sum")
6
>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64
```

It will surprise new users that these two give different results. Also, anyone using the second pattern is probably making a mistake.

Note that giving `np.sum` to `DataFrame.apply` aggregates properly:

```python
>>> small_ser.to_frame().apply(np.sum)
0    6
dtype: int64
```

## 1.5 Callables vs. list/dict of callables (added 2023-04-07)

```python
>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64
>>> small_ser.apply([np.sum])
sum    6
dtype: int64
```

Also with non-numpy callables:

```python
>>> small_ser.apply(lambda x: x.sum())
AttributeError: 'int' object has no attribute 'sum'
>>> small_ser.apply([lambda x: x.sum()])
<lambda>    6
dtype: int64
```

In both cases above the difference is that `Series.apply` operates element-wise, if given a callable, but series-wise if given a list/dict of callables.

## 2. Functions in `Series.apply` (& `Series.transform`)

The `Series.apply` doc string have examples with using lambdas, but lambdas in `Series.apply` is a bad practices because of bad performance:

```python
>>> %timeit large_ser.apply(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop
```

Currently, `Series` does not have a method that makes a callable operate on a series' data. Instead users need to use `Series.pipe` for that operation in order for the operation to be efficient:

```python
>>> %timeit large_ser.pipe(lambda x: x + 1)
44 µs ± 363 ns per loop
```

(The reason for the above performance differences is that apply gets called on each single element, while `pipe` calls `x.__add__(1)`, which operates on the whole array).

Note also that `.pipe` operates on the `Series` while `apply`currently operates on each element in the data, so there is some differences that may have some consequence in some cases.

Also notice that `Series.transform` has the same performance problems:

```python
>>> %timeit large_ser.transform(lambda x: x + 1)
24.1 ms ± 88.8 µs per loop
```

## 3. ufuncs in `Series.apply` vs. in `Series.map`

Performance-wise, ufuncs are fine in `Series.apply`, but not in `Series.map`:

```python
>>> %timeit large_ser.apply(np.sqrt)
71.6 µs ± 1.17 µs per loop
>>> %timeit large_ser.map(np.sqrt)
63.9 ms ± 69.5 µs per loop
```

It's difficult for users to understand why one is fast and the other slow (answer: only `apply` correctly works with ufuncs).

It is also difficult to understand why ufuncs are fast in `apply`, while other callables are slow in `apply` (answer: it's because ufuncs operate on the whole array, while other callables operate elementwise).

## 4. callables in `Series.apply` is bad, callables in `SeriesGroupby.apply` is fine

I showed above that using (non-ufunc) callables in `Series.apply` is bad performancewise. OTOH using them in `SeriesGroupby.apply` is fine:

```python
>>> %timeit large_ser.apply(lambda x: x + 1)
24.3 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit large_ser.groupby(large_ser > 50_000).apply(lambda x: x + 1)
11.3 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Note that most of the time in the groupby was used doing groupby ops, so the actual difference in the `apply` op is much larger, and similar to example 2 above.

Having callables being ok to use in the `SeriesGroupby.apply` method, but not in the `Series.Apply` is confusing IMO.

## 5: callables in `Series.apply` that return Series transform data to a DataFrame

`Series.apply` has an exception that if the callable returns a list-like of Series, the Series will be  concatenated to a DataFrame. This op is very slow operation and hence generally a bad idea:

```python
>>> small_ser.apply(lambda x: pd.Series([x, x+1], index["a", "b"]))
   a   b
0  0   1
1  1   2
2  2   3
>>> %timeit large_ser.apply(lambda x: pd.Series([x, x+1]))
# timing takes too long to measure
```

It's probably never a good idea to use this pattern, and e.g. `.pipe` is much faster, so e.g. `large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x+1}))` will be much faster. If we really do need operation on single element in that fashion it is still possible using `pipe`, e.g.  `large_ser.pipe(lambda x: pd.DataFrame({"a": x, "b": x.map(some_func)))` and also just directly `pd.DataFrame({"a": large_ser, "b": large_ser.map(some_func)))`.

So giving callables that return `Series` to `Series.apply` is a bad pattern and should be discouraged. (If users really want to do that pattern, they should build the list of Series themselves and take responsibilty for the slowdown).

## 6.  `Series.apply` vs. `Series.agg`

The doc string for `Series.agg` says about the method's `func` parameter: "If a function, must ... work when passed ... to Series.apply". But compare these:

```python
>>> small_ser.apply(np.sum)
0    1
1    2
2    3
dtype: int64
>>> small_ser.agg(np.sum)
6
```

You could argue the doc string is correct (it doesn't raise...), but you could also argue it isn't (because the results are different). I'd personally expect "must work when passed to series.apply" would mean "gives the same result when passed to to `agg` and to `apply`".

## 7. dictlikes vs. listlikes in `Series.apply`  (added 2023-06-04)

Giving a list of transforming arguments to `Series.apply` returns a `DataFrame`:

```python
>>> small_ser.apply(["sqrt", np.abs])
       sqrt  absolute
0  1.000000         1
1  1.414214         2
2  1.732051         3
```

But giving a dict of transforming arguments returns a `Series` with a `MultiIndex`:

```python
>>> small_ser.apply({"sqrt" :"sqrt", "abs" : np.abs})
sqrt  0    1.000000
      1    1.414214
      2    1.732051
abs   0    1.000000
      1    2.000000
      2    3.000000
dtype: float64
```

These two should give same-shaped output for consistency. Using `Series.transform` instead of `Series.apply`, it returns a `DataFrame` in both cases and I think the dictlike example above should return a `DataFrame` similar to the listlike example. 

Minor additional info: listlikes and dictlikes of aggregation arguments do behave the same, so this is only a problem with dictlikes of transforming arguments when using `apply`.

# Proposal

With the above in mind, I propose that:
1. `Series.apply` takes callables that always operate on the series. I.e. let `series.apply(func)` be similar to `func(series)` + the needed additional functionality.
2. `Series.map` takes callables that operate on each element individually. I.e. `series.map(func)` will be similar to the current `series._map_values(func)` + the needed additional functionality.
3. The parameter `convert_dtype` will be deprecated in `Series.apply` (already done in #52257).
4. A parameter `convert_dtype` will *NOT* be added to `Series.map` ([comment](https://github.com/pandas-dev/pandas/issues/52140#issuecomment-1482789939)) by @rhshadrach).
5. The ability in `Series.apply` to convert a `list[Series]` to a DataFrame will be deprecated (already done in #52123).
6. The ability to convert a `list[Series]` to a DataFrame will *NOT* be added to `Series.map`.
7. The changes made to `Series.apply`will propagate to `Series.agg` and `Series.transform`.

The difference between `Series.apply()` & `Series.map()` will then be that:

* `Series.apply()` makes the passed-in callable operate on the series, similarly to how `(DataFrame|SeriesGroupby|DataFrameGroupBy).apply.` operate on series. This is very fast and can do almost anything,
* `Series.map()` makes the passed-in callable operate on each series data elements individually. This is very flexible, but can be very slow, so should only be used if `Series.apply` can't do it.

so, IMO, this API change will help make Pandas `Series.(apply|map)` API  simpler without losing functionality and let their functionality  be explainable in a simple manner, which would be a win for Pandas.

# Deprecation process

The cumbersome part of the deprecation process will be to change `Series.apply` to only work array-wise, ie. to do `func(series._values)` always. This can be done by adding an `array_ops_only` parameter to `Series.apply`, so:

```python
>>> def apply(self, ..., array_ops_only: bool | NoDefault=no_default, ...):
    if array_ops_only is no_default:
        warn("....")
        array_ops_only = False
    ...
```

and then change the meaning of that parameter in pandas v3.0 again to make people remove from their code.

The other changes are more easy: `convert_dtype` in `Series.apply` will be deprecated  just like you would normally for method parameters. The ability to convert a list of Series to a DataFrame will emit a deprecation warning, when that code path is encountered.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API: make the func in Series.apply always operate on the Series #52140

Similarities and differences between `Series.apply` and `Series.map`

The problems

1: string vs numpy funcs in `Series.apply`

1.5 Callables vs. list/dict of callables (added 2023-04-07)

2. Functions in `Series.apply` (& `Series.transform`)

3. ufuncs in `Series.apply` vs. in `Series.map`

4. callables in `Series.apply` is bad, callables in `SeriesGroupby.apply` is fine

5: callables in `Series.apply` that return Series transform data to a DataFrame

6. `Series.apply` vs. `Series.agg`

7. dictlikes vs. listlikes in `Series.apply` (added 2023-06-04)

Proposal

Deprecation process

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API: make the func in Series.apply always operate on the Series #52140

Description

Similarities and differences between Series.apply and Series.map

The problems

1: string vs numpy funcs in Series.apply

1.5 Callables vs. list/dict of callables (added 2023-04-07)

2. Functions in Series.apply (& Series.transform)

3. ufuncs in Series.apply vs. in Series.map

4. callables in Series.apply is bad, callables in SeriesGroupby.apply is fine

5: callables in Series.apply that return Series transform data to a DataFrame

6. Series.apply vs. Series.agg

7. dictlikes vs. listlikes in Series.apply (added 2023-06-04)

Proposal

Deprecation process

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Similarities and differences between `Series.apply` and `Series.map`

1: string vs numpy funcs in `Series.apply`

2. Functions in `Series.apply` (& `Series.transform`)

3. ufuncs in `Series.apply` vs. in `Series.map`

4. callables in `Series.apply` is bad, callables in `SeriesGroupby.apply` is fine

5: callables in `Series.apply` that return Series transform data to a DataFrame

6. `Series.apply` vs. `Series.agg`

7. dictlikes vs. listlikes in `Series.apply` (added 2023-06-04)