Skip to content

Conversation

@NGA-TRAN
Copy link
Contributor

@NGA-TRAN NGA-TRAN commented Apr 12, 2023

Which issue does this PR close?

Closes #5689

Rationale for this change

In IOx, we see users use data_bin with month interval to group data by month for their monthly report. We think it is a very useful feature to support.

What changes are included in this PR?

Since each month does not include the same number of nanoseconds and for date_bin, it also depends on the origin input. This PR mainly add a function to handle month interval for date bin and some related refactoring work

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, date_bin with month and year interval is now working on constant input

@github-actions github-actions bot added core Core DataFusion crate physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) labels Apr 12, 2023

# Does not support months for Month-Day-Nano interval
statement error This feature is not implemented: DATE_BIN stride does not support month intervals
statement error DataFusion error: This feature is not implemented: DATE_BIN stride does not support combination of month, day and nanosecond intervals
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new error message I have added in this PR

@NGA-TRAN
Copy link
Contributor Author

NGA-TRAN commented Apr 12, 2023

@alamb, @tustvold, @wolffcm

@alamb
Copy link
Contributor

alamb commented Apr 13, 2023

The month interval for array input was converted into ScalarValue::IntervalDayTime (days) somewhere. My next PR is to find that place and convert month interval into ScalarValue::IntervalMonthDayNano instead. I think if I do that, they will automatically work with this code

I suspect the code that is do this is the coercion to the type signature defined here:

https://github.com/apache/arrow-datafusion/blob/1646bf6aae1e568e38c5a43e9a44c34e26222341/datafusion/expr/src/function.rs#L474-L494

I think you could move the Interval(MonthDayNano) signatures above the Interval(DayTime) ones

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good direction to me @NGA-TRAN -- thank you.

I left some comments that I think will help simplify the code

FYI @berkaysynnada @waitingkuo @mustafasrepo and @metesynnada who have been working in the interval area recently

…is not midnight of the first date of the month
Copy link
Contributor Author

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb . I have address your comments. I also fixed a bug I just found. It was about origin the is not first moment of the month

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Apr 14, 2023
@NGA-TRAN
Copy link
Contributor Author

@alamb

I think you could move the Interval(MonthDayNano) signatures above the Interval(DayTime) ones

It works like a charm. This PR now has dateline works for everything (if I do not miss other cases).
To avoid many duplicate tests with and without INTERVAL keyword, I added INTERVAL into tests without it before and remove it for the ones with it. These make the tests more concise and I know all of them work correctly

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NGA-TRAN -- I reviewed the PR carefully and it is looking good. I had some questions about the tests, some of which look incorrect -- I think we should resolve the question about incorrect results prior to merging as I think having incorrect results, even if just temporarily, is not a good thing


# 2-month interval in date_bin with default start time
query P
select date_bin('2 month', column1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

2022-03-01T00:00:00

# month interval with start date is end of the month plus some minutes
# Note the datetime '2022-03-31 00:00:00'. Its bin is NOT '2022-03-31 00:15:00' which is after its time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

----
2021-12-31T00:15:00
2021-12-31T00:15:00
2021-12-31T00:15:00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the timestamp

'2022-01-02 00:00:00'

would be binned into the prior year (2021)

2021-12-31

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return data is the start of the bin. The bin width is one year. The source data must be inside the bin. Since the origin is '1970-12-31T00:15:00Z', the start of the bins are '1970-12-31T00:15:00Z', '1971-12-31T00:15:00Z', ..., '2021-12-31T00:15:00Z', '2022-12-31T00:15:00Z', ...

'2022-01-02 00:00:00' is between '2021-12-31T00:15:00Z' and '2022-12-31T00:15:00Z' so it must be in the bin that starts with '2021-12-31T00:15:00Z'

};

let f = |x: Option<i64>| x.map(|x| date_bin_single(stride, x, origin));
let f = |x: Option<i64>| x.map(|x| stride.bin(x, origin));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is looking good

Copy link
Contributor Author

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb. I have explained the test results and addressed your comments

----
2021-12-31T00:15:00
2021-12-31T00:15:00
2021-12-31T00:15:00
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return data is the start of the bin. The bin width is one year. The source data must be inside the bin. Since the origin is '1970-12-31T00:15:00Z', the start of the bins are '1970-12-31T00:15:00Z', '1971-12-31T00:15:00Z', ..., '2021-12-31T00:15:00Z', '2022-12-31T00:15:00Z', ...

'2022-01-02 00:00:00' is between '2021-12-31T00:15:00Z' and '2022-12-31T00:15:00Z' so it must be in the bin that starts with '2021-12-31T00:15:00Z'

(timestamp '2022-01-02 00:00:00'),
(timestamp '2022-02-02 00:00:00'),
(timestamp '2022-02-15 00:00:00'),
(timestamp '2022-03-31 00:00:00')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@NGA-TRAN
Copy link
Contributor Author

@alamb The PR is now ready for re-reviewing with all questions answered

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you @NGA-TRAN

I took the liberty of merging this PR from main and adding a few test cases / comment clarifications in 0dc4329

Hopefully this is ok and I plan to merge this PR once the tests pass

cc @waitingkuo @stuartcarnie

@waitingkuo
Copy link
Contributor

@alamb thank you for pining me. @NGA-TRAN thank you for the great works. the codes looks good to me.

i tested this in your branch, i wonder whether this corner cases make sense

select date_bin('1 month', timestamp '2000-02-29T00:00:00', timestamp '2000-01-31T00:00:00');
+----------------------------------------------------------------------------------+
| datebin(Utf8("1 month"),Utf8("2000-02-29T00:00:00"),Utf8("2000-01-31T00:00:00")) |
+----------------------------------------------------------------------------------+
| 2000-02-29T00:00:00                                                              |
+----------------------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.
select date_bin('1 month', timestamp '2000-01-31T00:00:00', timestamp '2000-02-29T00:00:00');
+----------------------------------------------------------------------------------+
| datebin(Utf8("1 month"),Utf8("2000-01-31T00:00:00"),Utf8("2000-02-29T00:00:00")) |
+----------------------------------------------------------------------------------+
| 2000-01-29T00:00:00                                                              |
+----------------------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.

i feel like the mechanism behind isn't consistent. i thought that the second's is 2000-01-31T00:00:00

Comment on lines +380 to +387
if bin_time > source_date {
let month_delta = month_delta - stride_months;
bin_time = if month_delta < 0 {
origin_date - Months::new(month_delta.unsigned_abs() as u32)
} else {
origin_date + Months::new(month_delta as u32)
};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes 2000-01-31T00:00:00 becomes 2000-01-29T00:00:00 in
❯ select date_bin('1 month', timestamp '2000-01-31T00:00:00', timestamp '2000-02-29T00:00:00');

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@stuartcarnie stuartcarnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, @NGA-TRAN!

Not required for this PR, but perhaps something to consider in a followup is that I thought I would experiment with caching the stride function based on the interval input to the date_bin function:

https://github.com/NGA-TRAN/arrow-datafusion/blob/0dc43290387b7eadb24dcf7c9fcd3576ab3f2672/datafusion/physical-expr/src/datetime_expressions.rs#L511

I came up with the following change, that includes a benchmark of date_bin.

For the 1,000 element timestamp array the baseline (uncached):

month-day-nano/date_bin/1000
                        time:   [3.7791 µs 3.7870 µs 3.7958 µs]
                        thrpt:  [263.45 Melem/s 264.06 Melem/s 264.62 Melem/s]

and after the change, the throughput, which is the number of timestamps elements mapped by the function per second, improved consistently by about 28% or from 268 million to 338 million elements per second:

month-day-nano/date_bin/1000
                        time:   [2.9499 µs 2.9529 µs 2.9563 µs]
                        thrpt:  [338.26 Melem/s 338.65 Melem/s 339.00 Melem/s]
                 change:
                        time:   [-22.338% -22.151% -21.964%] (p = 0.00 < 0.05)
                        thrpt:  [+28.147% +28.454% +28.763%]
                        Performance has improved.
Subject: [PATCH] feat: cache stride function
---
Index: datafusion/physical-expr/Cargo.toml
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/datafusion/physical-expr/Cargo.toml b/datafusion/physical-expr/Cargo.toml
--- a/datafusion/physical-expr/Cargo.toml	(revision 0dc43290387b7eadb24dcf7c9fcd3576ab3f2672)
+++ b/datafusion/physical-expr/Cargo.toml	(date 1681677085430)
@@ -75,3 +75,7 @@
 [[bench]]
 harness = false
 name = "in_list"
+
+[[bench]]
+harness = false
+name = "date_bin"
Index: datafusion/physical-expr/benches/date_bin.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/datafusion/physical-expr/benches/date_bin.rs b/datafusion/physical-expr/benches/date_bin.rs
new file mode 100644
--- /dev/null	(date 1681679172908)
+++ b/datafusion/physical-expr/benches/date_bin.rs	(date 1681679172908)
@@ -0,0 +1,38 @@
+use arrow_array::TimestampNanosecondArray;
+use criterion::{
+    black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput,
+};
+use datafusion_common::ScalarValue;
+use datafusion_expr::ColumnarValue;
+use datafusion_physical_expr::datetime_expressions::date_bin;
+use std::sync::Arc;
+
+fn bench_mdn(c: &mut Criterion) {
+    let elem_10 =
+        TimestampNanosecondArray::from_iter_values((0..(10 as i64)).into_iter());
+    let elem_1000 =
+        TimestampNanosecondArray::from_iter_values((0..(1000 as i64)).into_iter());
+
+    let mut group = c.benchmark_group("month-day-nano");
+    for elements in [Arc::new(elem_10), Arc::new(elem_1000)].iter() {
+        group.throughput(Throughput::Elements(elements.len() as u64));
+
+        let args = vec![
+            ColumnarValue::Scalar(ScalarValue::new_interval_mdn(0, 1, 0)),
+            ColumnarValue::Array(elements.clone()),
+            ColumnarValue::Scalar(ScalarValue::TimestampNanosecond(Some(1), None)),
+        ];
+
+        group.bench_with_input(
+            BenchmarkId::new("date_bin", elements.len()),
+            &args,
+            |b, args| {
+                b.iter(|| date_bin(args));
+            },
+        );
+    }
+    group.finish();
+}
+
+criterion_group!(benches, bench_mdn);
+criterion_main!(benches);
Index: datafusion/physical-expr/src/datetime_expressions.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/datafusion/physical-expr/src/datetime_expressions.rs b/datafusion/physical-expr/src/datetime_expressions.rs
--- a/datafusion/physical-expr/src/datetime_expressions.rs	(revision 0dc43290387b7eadb24dcf7c9fcd3576ab3f2672)
+++ b/datafusion/physical-expr/src/datetime_expressions.rs	(date 1681679592855)
@@ -428,6 +428,13 @@
             Interval::Months(months) => date_bin_months_interval(*months, source, origin),
         }
     }
+
+    fn interval_fn(&self) -> (i64, fn(i64, i64, i64) -> i64) {
+        match self {
+            Interval::Nanoseconds(nanos) => (*nanos, date_bin_nanos_interval),
+            Interval::Months(months) => (*months, date_bin_months_interval),
+        }
+    }
 }
 
 // Supported intervals:
@@ -508,7 +515,8 @@
         )),
     };
 
-    let f = |x: Option<i64>| x.map(|x| stride.bin(x, origin));
+    let (stride, stride_fn) = stride.interval_fn();
+    let f = |x: Option<i64>| x.map(|x| stride_fn(stride, x, origin));
 
     Ok(match array {
         ColumnarValue::Scalar(ScalarValue::TimestampNanosecond(v, tz_opt)) => {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test coverage and comments, thanks @NGA-TRAN 💯

@NGA-TRAN
Copy link
Contributor Author

@waitingkuo

You have great tests. I agree they look inconsistent. I think we may want to define the behavior here and see whether we want to keep it or not:

I think this one is easy to understand: the origin is Jan 31 so any other month will always be the last day of that month (28, 29, 30, 31 depending on the month and year)

select date_bin('1 month', timestamp '2000-02-29T00:00:00', timestamp '2000-01-31T00:00:00');
+----------------------------------------------------------------------------------+
| datebin(Utf8("1 month"),Utf8("2000-02-29T00:00:00"),Utf8("2000-01-31T00:00:00")) |
+----------------------------------------------------------------------------------+
| 2000-02-29T00:00:00                                                              |
+----------------------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.

This is is tricky and we need to identify the behavior.

select date_bin('1 month', timestamp '2000-01-31T00:00:00', timestamp '2000-02-29T00:00:00');
+----------------------------------------------------------------------------------+
| datebin(Utf8("1 month"),Utf8("2000-01-31T00:00:00"),Utf8("2000-02-29T00:00:00")) |
+----------------------------------------------------------------------------------+
| 2000-01-29T00:00:00                                                              |
+----------------------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.

For February, 29 (or 28) is the last day of the month, but for other month is it not the last day of the month. So chromo's Months that we are using in this PR do not consider 29 as the last day of the month and just simple add/minus 30 days which I think makes sense.

select date_bin('1 month', timestamp '2000-01-31T00:00:00', timestamp '2000-03-29T00:00:00');
+----------------------------------------------------------------------------------+
| datebin(Utf8("1 month"),Utf8("2000-01-31T00:00:00"),Utf8("2000-03-29T00:00:00")) |
+----------------------------------------------------------------------------------+
| 2000-01-29T00:00:00                                                              |
+----------------------------------------------------------------------------------+
1 row in set. Query took 0.005 seconds.

So I think this is the behavior is acceptable. What do you, @alamb and @stuartcarnie think?

@NGA-TRAN
Copy link
Contributor Author

I am adding @stuartcarnie 's suggestion and benchmark into this PR

@NGA-TRAN
Copy link
Contributor Author

@alamb : I have added @stuartcarnie's the suggestion without the benchmark because I got Audit error if the benchmark is included. I am not sure what is the policy for adding benchmark

@stuartcarnie
Copy link
Contributor

That's great, thanks @NGA-TRAN! date_bin is used extensively in IOx, so downstream will definitely appreciate it!

@NGA-TRAN
Copy link
Contributor Author

I have added the tests with origin on 29th of Feb and Mar and interval is a month. I have added explanation for the tests on why the bins are always on the 29th

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone for their comments and help here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

date_bin doesn't work with months or years

4 participants