You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Aggregate improvements and SQL compatibility (#134)
* A lot of refactoring the the groupby. Mainly to include both distinct and null-grouping
* Test for non-dask aggregations
* All NaN data needs to go into the same partition (otherwise we can not sort)
* Fix compatibility with SQL on null-joins
* Distinct is not needed, as it is optimized away from Calcite
* Implement is not distinct
* Describe new limitations and remove old ones
* Added compatibility test from fugue
* Added a test for sorting with multiple partitions and NaNs
* Stylefix
Copy file name to clipboardExpand all lines: docs/pages/sql.rst
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -199,14 +199,16 @@ Limitatons
199
199
200
200
``dask-sql`` is still in early development, therefore exist some limitations:
201
201
202
-
* Not all operations and aggregations are implemented already, most prominently: ``WINDOW`` is not implemented so far.
203
-
* ``GROUP BY`` aggregations can not use ``DISTINCT``
202
+
Not all operations and aggregations are implemented already, most prominently: ``WINDOW`` is not implemented so far.
204
203
205
204
.. note::
206
205
207
206
Whenever you find a not already implemented operation, keyword
208
207
or functionality, please raise an issue at our `issue tracker <https://github.com/nils-braun/dask-sql/issues>`_ with your use-case.
209
208
209
+
Dask/pandas and SQL treat null-values (or nan) differently on sorting, grouping and joining.
210
+
``dask-sql`` tries to follow the SQL standard as much as possible, so results might be different to what you expect from Dask/pandas.
211
+
210
212
Apart from those functional limitations, there is a operation which need special care: ``ORDER BY```.
211
213
Normally, ``dask-sql`` calls create a ``dask`` data frame, which gets only computed when you call the ``.compute()`` member.
212
214
Due to internal constraints, this is currently not the case for ``ORDER BY``.
@@ -218,4 +220,5 @@ Including this operation will trigger a calculation of the full data frame alrea
218
220
The data inside ``dask`` is partitioned, to distribute it over the cluster.
219
221
``head`` will only return the first N elements from the first partition - even if N is larger than the partition size.
220
222
As a benefit, calling ``.head(N)`` is typically faster than calculating the full data sample with ``.compute()``.
221
-
``LIMIT`` on the other hand will always return the first N elements - no matter on how many partitions they are scattered - but will also need to precalculate the first partition to find out, if it needs to have a look into all data or not.
223
+
``LIMIT`` on the other hand will always return the first N elements - no matter on how many partitions they are scattered -
224
+
but will also need to precalculate the first partition to find out, if it needs to have a look into all data or not.
0 commit comments