Support `=`, `<`, `<=`, `>`, `>=`, `!=`, `is distinct from`, `is not distinct from` for `BooleanArray` #1163

alamb · 2021-10-21T18:17:39Z

Which issue does this PR close?

Resolves #1159

PR is mostly tests

Sorry for what looks like a large PR :( I blame the test

Rationale for this change

This is mostly interesting so that during constant folding / simplification we can simplify down de-generate expressions like true = false

Now that @jimexist added apache/arrow-rs#860 and @Dandandan added apache/arrow-rs#844 in arrow, this PR hooks that up

Also, it has the nice side effect benefit parquet row group pruning is now supported for boolean columns as well 🎉

What changes are included in this PR?

Update to arrow 6.2.0
Support =, <, <=, >, >=, !=,is distinct from, is not distinct from for BooleanArray(aka for boolean columns)
Simple implementations of *_scalar_bool
Many tests
Update the pruning tests to reflect the fact that boolean pruning now happens

Are there any user-facing changes?

Less errors

alamb · 2021-10-21T19:27:42Z

datafusion/src/physical_plan/expressions/binary.rs

    Ok(Arc::new(BinaryExpr::new(l, op, r)))
 }

+// TODO file a ticket with arrow-rs to include these kernels


Suggested change

// TODO file a ticket with arrow-rs to include these kernels

// When arrow-rs has these kernels, can remove this implementation

// see https://github.com/apache/arrow-rs/issues/842

Filed ticket in arrow apache/arrow-rs#842

FWIW @Dandandan has added these kernels upstream to Arrow so we can use 6.1.0 when that comes out (in a week or so): apache/arrow-rs#844

@jimexist has actually implemented operations like bool_lt etc in apache/arrow-rs#860 so when that is available in datafusion (next week) I will update this PR to include those operations as well

alamb · 2021-11-02T20:03:59Z

This one is waiting on arrow-rs 6.1 to be released, and then I should be able to clean it up and get it ready for a proper review

alamb · 2021-11-05T16:56:14Z

Turns out that we forgot to make the required functions public 🤦 . Will wait for arrow 6.2 to include apache/arrow-rs#913

…distinct from` for `BooleanArray`

alamb

FYI @Dandandan @jimexist

alamb · 2021-11-15T21:51:01Z

datafusion/src/physical_optimizer/pruning.rs

-            result
-        )
+        let result = p.prune(&statistics).unwrap();
+        assert_eq!(result, expected_true);


pruning works for boolean columns now

datafusion/src/physical_plan/expressions/binary.rs

alamb · 2021-11-15T21:53:27Z

datafusion/src/physical_plan/expressions/binary.rs

            DataType::Date64 => {
                compute_op_scalar!($LEFT, $RIGHT, $OP, Date64Array)
            }
+            DataType::Boolean => compute_bool_op_scalar!($LEFT, $RIGHT, $OP, BooleanArray),


adding this line and the one below it adds all the new support, which is kind of cool! It is terrifying how many functions end up being called :)

alamb · 2021-11-15T21:53:56Z

datafusion/src/physical_plan/file_format/parquet.rs

-        // where a null array is generated for some statistics columns
-        // int > 1 and bool = true => c1_max > 1 and null
-        let expr = col("c1").gt(lit(15)).and(col("c2").eq(lit(true)));
+        // test row group predicate with an unknown (Null) expr


now bool stats don't result in null columns, so I needed to use a constant to get the same effect

alamb · 2021-11-18T22:15:06Z

This PR is ready for review / analysis if/when you get a chance @jimexist / @Dandandan / @houqp / @rdettai. It looks much bigger than it is because of the tests. It is mostly about hooking up some more arrow compute kernels

There are many PRs flying in DataFusion recently 😅 fun times

rdettai

Great addition @alamb ! thanks !

datafusion/src/physical_plan/expressions/binary.rs

Co-authored-by: rdettai <[email protected]>

alamb

Thanks for the review @rdettai -- I'll plan to merge this one in tomorrow and file arrow-rs tickets if there are no other comments.

jimexist · 2021-11-20T10:26:13Z

datafusion/src/physical_plan/expressions/binary.rs

+            .expect("compute_op failed to downcast array");
+        // generate the scalar function name, such as lt_scalar, from the $OP parameter
+        // (which could have a value of lt) and the suffix _scalar
+        Ok(Arc::new(paste::expr! {[<$OP _bool_scalar>]}(


For the record this pattern is used elsewhere in this file, I was just following it :)

datafusion/src/physical_plan/expressions/binary.rs

Co-authored-by: Jiayu Liu <[email protected]>

* feat: add support for array_contains expression * test: add unit test for array_contains function * Removes unnecessary case expression for handling null values * chore: Move more expressions from core crate to spark-expr crate (apache#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (apache#1155) * fix: Spark 4.0-preview1 SPARK-47120 (apache#1156) ## Which issue does this PR close? Part of apache/datafusion-comet#372 and apache/datafusion-comet#551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (apache#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (apache#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (apache#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (apache#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (apache#1176) * test: enabling Spark tests with offHeap requirement (apache#1177) ## Which issue does this PR close? ## Rationale for this change After apache/datafusion-comet#1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (apache#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * fix: stddev_pop should not directly return 0.0 when count is 1.0 (apache#1184) * add test * fix * fix * fix * feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (apache#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto * minor: move shuffle classes from common to spark (apache#1193) * minor: refactor decodeBatches to make private in broadcast exchange (apache#1195) * minor: refactor prepare_output so that it does not require an ExecutionContext (apache#1194) * fix: fix missing explanation for then branch in case when (apache#1200) * minor: remove unused source files (apache#1202) * chore: Upgrade to DataFusion 44.0.0-rc2 (apache#1154) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * save * save * save * remove unused imports * clippy * implement more hashers * implement Hash and PartialEq * implement Hash and PartialEq * implement Hash and PartialEq * benches * fix ScalarUDFImpl.return_type failure * exclude test from miri * ignore correct test * ignore another test * remove miri checks * use return_type_from_exprs * Revert "use return_type_from_exprs" This reverts commit febc1f1ec1301f9b359fc23ad6a117224fce35b7. * use DF main branch * hacky workaround for regression in ScalarUDFImpl.return_type * fix repo url * pin to revision * bump to latest rev * bump to latest DF rev * bump DF to rev 9f530dd * add Cargo.lock * bump DF version * no default features * Revert "remove miri checks" This reverts commit 4638fe3aa5501966cd5d8b53acf26c698b10b3c9. * Update pin to DataFusion e99e02b * update pin * Update Cargo.toml Bump to 44.0.0-rc2 * update cargo lock * revert miri change --------- Co-authored-by: Andrew Lamb <[email protected]> * update UT Signed-off-by: Dharan Aditya <[email protected]> * fix typo in UT Signed-off-by: Dharan Aditya <[email protected]> --------- Signed-off-by: Dharan Aditya <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: KAZUYUKI TANIMURA <[email protected]> Co-authored-by: Parth Chandra <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raz Luvaton <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

github-actions bot added the datafusion label Oct 21, 2021

This was referenced Oct 21, 2021

(WIP) Cleanup: remove redundant constant evaluation in Simplifier #1164

Closed

Add boolean equality and inequality kernels apache/arrow-rs#842

Closed

add compute kernels that operate on ArrayRef -- e.g eq_dyn apache/arrow-rs#843

Closed

alamb commented Oct 21, 2021

View reviewed changes

alamb marked this pull request as draft October 26, 2021 10:51

alamb changed the title ~~Support <bool col> = <bool col> and <bool col> != <bool col>~~ (WIP) Support <bool col> = <bool col> and <bool col> != <bool col> Oct 26, 2021

alamb force-pushed the alamb/bool_expr branch from fda35ac to 3cb11a4 Compare November 4, 2021 20:13

alamb mentioned this pull request Nov 5, 2021

fix: not do boolean folding on NULL and/or expr #1245

Merged

alamb force-pushed the alamb/bool_expr branch from 3cb11a4 to 5d09920 Compare November 15, 2021 19:18

alamb changed the title ~~(WIP) Support <bool col> = <bool col> and <bool col> != <bool col>~~ Support <bool col> = <bool col> and <bool col> != <bool col> Nov 15, 2021

alamb marked this pull request as ready for review November 15, 2021 19:19

alamb changed the title ~~Support <bool col> = <bool col> and <bool col> != <bool col>~~ Support =, <, <=, >, >=, != for BooleanArray Nov 15, 2021

alamb changed the title ~~Support =, <, <=, >, >=, != for BooleanArray~~ Support =, <, <=, >, >=, !=, is distinct from, is not distinct from for BooleanArray Nov 15, 2021

Support =, <, <=, >, >=, !=, is distinct from, `is not …

c990851

…distinct from` for `BooleanArray`

alamb force-pushed the alamb/bool_expr branch from 5d09920 to c990851 Compare November 15, 2021 21:50

alamb commented Nov 15, 2021

View reviewed changes

alamb requested review from Dandandan and jimexist November 15, 2021 22:06

rdettai reviewed Nov 19, 2021

View reviewed changes

datafusion/src/physical_plan/expressions/binary.rs Outdated Show resolved Hide resolved

datafusion/src/physical_plan/expressions/binary.rs Outdated Show resolved Hide resolved

Update datafusion/src/physical_plan/expressions/binary.rs

c847623

Co-authored-by: rdettai <[email protected]>

alamb commented Nov 19, 2021

View reviewed changes

Merge remote-tracking branch 'apache/master' into alamb/bool_expr

d2aafc7

jimexist reviewed Nov 20, 2021

View reviewed changes

datafusion/src/physical_plan/expressions/binary.rs Outdated Show resolved Hide resolved

jimexist reviewed Nov 20, 2021

View reviewed changes

datafusion/src/physical_plan/expressions/binary.rs Outdated Show resolved Hide resolved

jimexist reviewed Nov 20, 2021

View reviewed changes

datafusion/src/physical_plan/expressions/binary.rs Outdated Show resolved Hide resolved

jimexist approved these changes Nov 20, 2021

View reviewed changes

This was referenced Nov 20, 2021

Add boolean comparison to scalar kernels for less then, greater than apache/arrow-rs#959

Closed

Consider adding is_distinct_from kernels apache/arrow-rs#960

Closed

alamb and others added 2 commits November 20, 2021 06:59

Apply suggestions from code review

2740e7b

Co-authored-by: Jiayu Liu <[email protected]>

Merge remote-tracking branch 'apache/master' into alamb/bool_expr

aa8b39c

alamb merged commit 00850a4 into apache:master Nov 20, 2021

alamb deleted the alamb/bool_expr branch November 20, 2021 12:59

liukun4515 mentioned this pull request Dec 25, 2021

Implement DECIMAL type #122

Closed

alamb added the enhancement New feature or request label Feb 10, 2022

	// TODO file a ticket with arrow-rs to include these kernels
	// When arrow-rs has these kernels, can remove this implementation
	// see https://github.com/apache/arrow-rs/issues/842

Support =, <, <=, >, >=, !=, is distinct from, is not distinct from for BooleanArray #1163

Support =, <, <=, >, >=, !=, is distinct from, is not distinct from for BooleanArray #1163

Uh oh!

Conversation

alamb commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

PR is mostly tests

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 2, 2021

Uh oh!

alamb commented Nov 5, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 18, 2021

Uh oh!

rdettai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support `=`, `<`, `<=`, `>`, `>=`, `!=`, `is distinct from`, `is not distinct from` for `BooleanArray` #1163

Support `=`, `<`, `<=`, `>`, `>=`, `!=`, `is distinct from`, `is not distinct from` for `BooleanArray` #1163

alamb commented Oct 21, 2021 •

edited

Loading