-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Problem
In the PR for #15486, I found that type validation for the fill_value parameters strewn across a large number of pandas API methods is done ad-hoc. This results in a wide variety of possible accepted inputs. I think it would be good to standardize this so that all of these methods use the same behavior, the one currently used by fillna.
Implementation Details
Partially the point of providing a fill_value is to avoid having to do a slow-down type conversion otherwise (using .fillna().astype()). However, specifying other formats is nevertheless a useful convenience to have. Implementation would roughly be:
Before executing the rest of the method body, check whether or not the fill_value is valid (using a centralized maybe_fill method). If it is not, throw a ValueError. If it is, check whether or not incorporating the fill_value would result in an upcast in the column dtype. If it would not, follow a code path where the column never gets type-converted. If it would, follow that same code path, then do something like a filla operation at the end before returning.
Target Implementation
The same as what fillna currently does. Which follows.
Invalid:
categoricalfill for a category not in the categories will raise aValueError.sparsematrices refuse upcasting.- Passing an object or list or other non-coercable "thing" as a fill.
Valid, upcast:
intfill will promotebooldtypes toint.floatfill will promoteintandbooldtypes tofloat(this is what happens withnp.nanalready).object(str) fill would promote lesser dtypes toobject.int,float, andboolfill to adatetimedtype will be treated as a UNIX-like timestamp and promoted todatetime.objectfill will promotedatetimedtype toobject.
Valid, no-cast:
- Everything else.
Current Implementation
...is ad-hoc. The following are the methods which currently provide a fill_value input, as well as where they deviate from the model above.
-
Series.combine,DataFrame.combine,Series.to_sparse: These are unique usages offill_valuewhich aren't compatible with the rest of them. -
Series.unstack,DataFrame.unstack: anyfill_valueis allowed. You can pass an object if you'd like, or even anotherDataFrame(yo dawg...). -
DataFrame.align: Anyfill_valueis allowed. -
DataFrame.reindex_axis: Lists and dicts are allowed, objects are not. -
DataFrame.asfreq,Series.asfreq: anyfill_valueis allowed. -
pd.pivot_table: ... -
Series.add,DataFrame.add: ... -
Series.subtract,DataFrame.substract: ... -
Probably others, there's a lot of these.