-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Closed
Labels
Milestone
Description
In my notebook comparing dplyr and pandas, I gained a new level of appreciation for the ability to chain strings of operations together. In my own code, the biggest impediment to this is adding additional columns that are calculations on existing columns. For example
# R / dplyr
mutate(flights,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
# ... calculation involving thesevs.
flights['gain'] = flights.arr_delay - flights.dep_delay
flights['speed'] = flights.distance / flights.air_time * 60
# ... calculation involving these laterjust doesn't flow as nicely, especially if this mutate is in the middle of a chain.
I'd propose a new method (perhaps stealing mutate) that's similar to dplyr's.
The function signature could be kwarg only, where the keywords are the new column names. e.g.
flights.mutate(gain=flights.arr_delay - flights.dep_delayThis would return a DataFrame with the new column gain in addition to the original columns.
Worked out example
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
(iris.query('sepal_length > 4.5')
.mutate(ratio=iris.sepal_length / iris.sepal_width) # new part
.groupby(pd.cut(iris.ratio)).mean()
)Thoughts?