|  | 
| 20 | 20 | Aggregation | 
| 21 | 21 | ============ | 
| 22 | 22 | 
 | 
| 23 |  | -An aggregate or aggregation is a function where the values of multiple rows are processed together to form a single summary value. | 
| 24 |  | -For performing an aggregation, DataFusion provides the :py:func:`~datafusion.dataframe.DataFrame.aggregate` | 
|  | 23 | +An aggregate or aggregation is a function where the values of multiple rows are processed together | 
|  | 24 | +to form a single summary value. For performing an aggregation, DataFusion provides the | 
|  | 25 | +:py:func:`~datafusion.dataframe.DataFrame.aggregate` | 
| 25 | 26 | 
 | 
| 26 | 27 | .. ipython:: python | 
| 27 | 28 | 
 | 
|  | 29 | +    import urllib.request | 
| 28 | 30 |     from datafusion import SessionContext | 
| 29 |  | -    from datafusion import column, lit | 
|  | 31 | +    from datafusion import col, lit | 
| 30 | 32 |     from datafusion import functions as f | 
| 31 |  | -    import random | 
| 32 | 33 | 
 | 
| 33 |  | -    ctx = SessionContext() | 
| 34 |  | -    df = ctx.from_pydict( | 
| 35 |  | -        { | 
| 36 |  | -            "a": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], | 
| 37 |  | -            "b": ["one", "one", "two", "three", "two", "two", "one", "three"], | 
| 38 |  | -            "c": [random.randint(0, 100) for _ in range(8)], | 
| 39 |  | -            "d": [random.random() for _ in range(8)], | 
| 40 |  | -        }, | 
| 41 |  | -        name="foo_bar" | 
|  | 34 | +    urllib.request.urlretrieve( | 
|  | 35 | +        "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv", | 
|  | 36 | +        "pokemon.csv", | 
| 42 | 37 |     ) | 
| 43 | 38 | 
 | 
| 44 |  | -    col_a = column("a") | 
| 45 |  | -    col_b = column("b") | 
| 46 |  | -    col_c = column("c") | 
| 47 |  | -    col_d = column("d") | 
|  | 39 | +    ctx = SessionContext() | 
|  | 40 | +    df = ctx.read_csv("pokemon.csv") | 
|  | 41 | +
 | 
|  | 42 | +    col_type_1 = col('"Type 1"') | 
|  | 43 | +    col_type_2 = col('"Type 2"') | 
|  | 44 | +    col_speed = col('"Speed"') | 
|  | 45 | +    col_attack = col('"Attack"') | 
| 48 | 46 | 
 | 
| 49 |  | -    df.aggregate([], [f.approx_distinct(col_c), f.approx_median(col_d), f.approx_percentile_cont(col_d, lit(0.5))]) | 
|  | 47 | +    df.aggregate([col_type_1], [ | 
|  | 48 | +        f.approx_distinct(col_speed).alias("Count"), | 
|  | 49 | +        f.approx_median(col_speed).alias("Median Speed"), | 
|  | 50 | +        f.approx_percentile_cont(col_speed, 0.9).alias("90% Speed")]) | 
| 50 | 51 | 
 | 
| 51 |  | -When the :code:`group_by` list is empty the aggregation is done over the whole :class:`.DataFrame`. For grouping | 
| 52 |  | -the :code:`group_by` list must contain at least one column | 
|  | 52 | +When the :code:`group_by` list is empty the aggregation is done over the whole :class:`.DataFrame`. | 
|  | 53 | +For grouping the :code:`group_by` list must contain at least one column. | 
| 53 | 54 | 
 | 
| 54 | 55 | .. ipython:: python | 
| 55 | 56 | 
 | 
| 56 |  | -    df.aggregate([col_a], [f.sum(col_c), f.max(col_d), f.min(col_d)]) | 
|  | 57 | +    df.aggregate([col_type_1], [ | 
|  | 58 | +        f.max(col_speed).alias("Max Speed"), | 
|  | 59 | +        f.avg(col_speed).alias("Avg Speed"), | 
|  | 60 | +        f.min(col_speed).alias("Min Speed")]) | 
| 57 | 61 | 
 | 
| 58 | 62 | More than one column can be used for grouping | 
| 59 | 63 | 
 | 
| 60 | 64 | .. ipython:: python | 
| 61 | 65 | 
 | 
| 62 |  | -    df.aggregate([col_a, col_b], [f.sum(col_c), f.max(col_d), f.min(col_d)]) | 
|  | 66 | +    df.aggregate([col_type_1, col_type_2], [ | 
|  | 67 | +        f.max(col_speed).alias("Max Speed"), | 
|  | 68 | +        f.avg(col_speed).alias("Avg Speed"), | 
|  | 69 | +        f.min(col_speed).alias("Min Speed")]) | 
|  | 70 | +
 | 
|  | 71 | +
 | 
|  | 72 | +
 | 
|  | 73 | +Setting Parameters | 
|  | 74 | +------------------ | 
|  | 75 | + | 
|  | 76 | +Each of the built in aggregate functions provides arguments for the parameters that affect their | 
|  | 77 | +operation. These can also be overridden using the builder approach to setting any of the following | 
|  | 78 | +parameters. When you use the builder, you must call ``build()`` to finish. For example, these two | 
|  | 79 | +expressions are equivalent. | 
|  | 80 | + | 
|  | 81 | +.. ipython:: python | 
|  | 82 | +
 | 
|  | 83 | +    first_1 = f.first_value(col("a"), order_by=[col("a")]) | 
|  | 84 | +    first_2 = f.first_value(col("a")).order_by(col("a")).build() | 
|  | 85 | +
 | 
|  | 86 | +Ordering | 
|  | 87 | +^^^^^^^^ | 
|  | 88 | + | 
|  | 89 | +You can control the order in which rows are processed by window functions by providing | 
|  | 90 | +a list of ``order_by`` functions for the ``order_by`` parameter. In the following example, we | 
|  | 91 | +sort the Pokemon by their attack in increasing order and take the first value, which gives us the | 
|  | 92 | +Pokemon with the smallest attack value in each ``Type 1``. | 
|  | 93 | + | 
|  | 94 | +.. ipython:: python | 
|  | 95 | +
 | 
|  | 96 | +    df.aggregate( | 
|  | 97 | +        [col('"Type 1"')], | 
|  | 98 | +        [f.first_value( | 
|  | 99 | +            col('"Name"'), | 
|  | 100 | +            order_by=[col('"Attack"').sort(ascending=True)] | 
|  | 101 | +            ).alias("Smallest Attack") | 
|  | 102 | +        ]) | 
|  | 103 | +
 | 
|  | 104 | +Distinct | 
|  | 105 | +^^^^^^^^ | 
|  | 106 | + | 
|  | 107 | +When you set the parameter ``distinct`` to ``True``, then unique values will only be evaluated one | 
|  | 108 | +time each. Suppose we want to create an array of all of the ``Type 2`` for each ``Type 1`` of our | 
|  | 109 | +Pokemon set. Since there will be many entries of ``Type 2`` we only one each distinct value. | 
|  | 110 | + | 
|  | 111 | +.. ipython:: python | 
|  | 112 | +
 | 
|  | 113 | +    df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type 2 List")]) | 
|  | 114 | +
 | 
|  | 115 | +In the output of the above we can see that there are some ``Type 1`` for which the ``Type 2`` entry | 
|  | 116 | +is ``null``. In reality, we probably want to filter those out. We can do this in two ways. First, | 
|  | 117 | +we can filter DataFrame rows that have no ``Type 2``. If we do this, we might have some ``Type 1`` | 
|  | 118 | +entries entirely removed. The second is we can use the ``filter`` argument described below. | 
|  | 119 | + | 
|  | 120 | +.. ipython:: python | 
|  | 121 | +
 | 
|  | 122 | +    df.filter(col_type_2.is_not_null()).aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type 2 List")]) | 
|  | 123 | +
 | 
|  | 124 | +    df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True, filter=col_type_2.is_not_null()).alias("Type 2 List")]) | 
|  | 125 | +
 | 
|  | 126 | +Which approach you take should depend on your use case. | 
|  | 127 | + | 
|  | 128 | +Null Treatment | 
|  | 129 | +^^^^^^^^^^^^^^ | 
|  | 130 | + | 
|  | 131 | +This option allows you to either respect or ignore null values. | 
|  | 132 | + | 
|  | 133 | +One common usage for handling nulls is the case where you want to find the first value within a | 
|  | 134 | +partition. By setting the null treatment to ignore nulls, we can find the first non-null value | 
|  | 135 | +in our partition. | 
|  | 136 | + | 
|  | 137 | + | 
|  | 138 | +.. ipython:: python | 
|  | 139 | +
 | 
|  | 140 | +    from datafusion.common import NullTreatment | 
|  | 141 | +
 | 
|  | 142 | +    df.aggregate([col_type_1], [ | 
|  | 143 | +        f.first_value( | 
|  | 144 | +            col_type_2, | 
|  | 145 | +            order_by=[col_attack], | 
|  | 146 | +            null_treatment=NullTreatment.RESPECT_NULLS | 
|  | 147 | +        ).alias("Lowest Attack Type 2")]) | 
|  | 148 | +
 | 
|  | 149 | +    df.aggregate([col_type_1], [ | 
|  | 150 | +        f.first_value( | 
|  | 151 | +            col_type_2, | 
|  | 152 | +            order_by=[col_attack], | 
|  | 153 | +            null_treatment=NullTreatment.IGNORE_NULLS | 
|  | 154 | +        ).alias("Lowest Attack Type 2")]) | 
|  | 155 | +
 | 
|  | 156 | +Filter | 
|  | 157 | +^^^^^^ | 
|  | 158 | + | 
|  | 159 | +Using the filter option is useful for filtering results to include in the aggregate function. It can | 
|  | 160 | +be seen in the example above on how this can be useful to only filter rows evaluated by the | 
|  | 161 | +aggregate function without filtering rows from the entire DataFrame. | 
|  | 162 | + | 
|  | 163 | +Filter takes a single expression. | 
|  | 164 | + | 
|  | 165 | +Suppose we want to find the speed values for only Pokemon that have low Attack values. | 
|  | 166 | + | 
|  | 167 | +.. ipython:: python | 
|  | 168 | +
 | 
|  | 169 | +    df.aggregate([col_type_1], [ | 
|  | 170 | +        f.avg(col_speed).alias("Avg Speed All"), | 
|  | 171 | +        f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")]) | 
|  | 172 | +
 | 
|  | 173 | +
 | 
|  | 174 | +Aggregate Functions | 
|  | 175 | +------------------- | 
|  | 176 | + | 
|  | 177 | +The available aggregate functions are: | 
|  | 178 | + | 
|  | 179 | +1. Comparison Functions | 
|  | 180 | +    - :py:func:`datafusion.functions.min` | 
|  | 181 | +    - :py:func:`datafusion.functions.max` | 
|  | 182 | +2. Math Functions | 
|  | 183 | +    - :py:func:`datafusion.functions.sum` | 
|  | 184 | +    - :py:func:`datafusion.functions.avg` | 
|  | 185 | +    - :py:func:`datafusion.functions.median` | 
|  | 186 | +3. Array Functions | 
|  | 187 | +    - :py:func:`datafusion.functions.array_agg` | 
|  | 188 | +4. Logical Functions | 
|  | 189 | +    - :py:func:`datafusion.functions.bit_and` | 
|  | 190 | +    - :py:func:`datafusion.functions.bit_or` | 
|  | 191 | +    - :py:func:`datafusion.functions.bit_xor` | 
|  | 192 | +    - :py:func:`datafusion.functions.bool_and` | 
|  | 193 | +    - :py:func:`datafusion.functions.bool_or` | 
|  | 194 | +5. Statistical Functions | 
|  | 195 | +    - :py:func:`datafusion.functions.count` | 
|  | 196 | +    - :py:func:`datafusion.functions.corr` | 
|  | 197 | +    - :py:func:`datafusion.functions.covar_samp` | 
|  | 198 | +    - :py:func:`datafusion.functions.covar_pop` | 
|  | 199 | +    - :py:func:`datafusion.functions.stddev` | 
|  | 200 | +    - :py:func:`datafusion.functions.stddev_pop` | 
|  | 201 | +    - :py:func:`datafusion.functions.var_samp` | 
|  | 202 | +    - :py:func:`datafusion.functions.var_pop` | 
|  | 203 | +6. Linear Regression Functions | 
|  | 204 | +    - :py:func:`datafusion.functions.regr_count` | 
|  | 205 | +    - :py:func:`datafusion.functions.regr_slope` | 
|  | 206 | +    - :py:func:`datafusion.functions.regr_intercept` | 
|  | 207 | +    - :py:func:`datafusion.functions.regr_r2` | 
|  | 208 | +    - :py:func:`datafusion.functions.regr_avgx` | 
|  | 209 | +    - :py:func:`datafusion.functions.regr_avgy` | 
|  | 210 | +    - :py:func:`datafusion.functions.regr_sxx` | 
|  | 211 | +    - :py:func:`datafusion.functions.regr_syy` | 
|  | 212 | +    - :py:func:`datafusion.functions.regr_slope` | 
|  | 213 | +7. Positional Functions | 
|  | 214 | +    - :py:func:`datafusion.functions.first_value` | 
|  | 215 | +    - :py:func:`datafusion.functions.last_value` | 
|  | 216 | +    - :py:func:`datafusion.functions.nth_value` | 
|  | 217 | +8. String Functions | 
|  | 218 | +    - :py:func:`datafusion.functions.string_agg` | 
|  | 219 | +9. Approximation Functions | 
|  | 220 | +    - :py:func:`datafusion.functions.approx_distinct` | 
|  | 221 | +    - :py:func:`datafusion.functions.approx_median` | 
|  | 222 | +    - :py:func:`datafusion.functions.approx_percentile_cont` | 
|  | 223 | +    - :py:func:`datafusion.functions.approx_percentile_cont_with_weight` | 
|  | 224 | + | 
0 commit comments