-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: support month and year interval for date_bin on constant data #5982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| # Does not support months for Month-Day-Nano interval | ||
| statement error This feature is not implemented: DATE_BIN stride does not support month intervals | ||
| statement error DataFusion error: This feature is not implemented: DATE_BIN stride does not support combination of month, day and nanosecond intervals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new error message I have added in this PR
I suspect the code that is do this is the coercion to the type signature defined here: I think you could move the |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good direction to me @NGA-TRAN -- thank you.
I left some comments that I think will help simplify the code
FYI @berkaysynnada @waitingkuo @mustafasrepo and @metesynnada who have been working in the interval area recently
…is not midnight of the first date of the month
NGA-TRAN
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb . I have address your comments. I also fixed a bug I just found. It was about origin the is not first moment of the month
It works like a charm. This PR now has dateline works for everything (if I do not miss other cases). |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NGA-TRAN -- I reviewed the PR carefully and it is looking good. I had some questions about the tests, some of which look incorrect -- I think we should resolve the question about incorrect results prior to merging as I think having incorrect results, even if just temporarily, is not a good thing
|
|
||
| # 2-month interval in date_bin with default start time | ||
| query P | ||
| select date_bin('2 month', column1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| 2022-03-01T00:00:00 | ||
|
|
||
| # month interval with start date is end of the month plus some minutes | ||
| # Note the datetime '2022-03-31 00:00:00'. Its bin is NOT '2022-03-31 00:15:00' which is after its time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| ---- | ||
| 2021-12-31T00:15:00 | ||
| 2021-12-31T00:15:00 | ||
| 2021-12-31T00:15:00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the timestamp
'2022-01-02 00:00:00'
would be binned into the prior year (2021)
2021-12-31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return data is the start of the bin. The bin width is one year. The source data must be inside the bin. Since the origin is '1970-12-31T00:15:00Z', the start of the bins are '1970-12-31T00:15:00Z', '1971-12-31T00:15:00Z', ..., '2021-12-31T00:15:00Z', '2022-12-31T00:15:00Z', ...
'2022-01-02 00:00:00' is between '2021-12-31T00:15:00Z' and '2022-12-31T00:15:00Z' so it must be in the bin that starts with '2021-12-31T00:15:00Z'
| }; | ||
|
|
||
| let f = |x: Option<i64>| x.map(|x| date_bin_single(stride, x, origin)); | ||
| let f = |x: Option<i64>| x.map(|x| stride.bin(x, origin)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this is looking good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb. I have explained the test results and addressed your comments
| ---- | ||
| 2021-12-31T00:15:00 | ||
| 2021-12-31T00:15:00 | ||
| 2021-12-31T00:15:00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return data is the start of the bin. The bin width is one year. The source data must be inside the bin. Since the origin is '1970-12-31T00:15:00Z', the start of the bins are '1970-12-31T00:15:00Z', '1971-12-31T00:15:00Z', ..., '2021-12-31T00:15:00Z', '2022-12-31T00:15:00Z', ...
'2022-01-02 00:00:00' is between '2021-12-31T00:15:00Z' and '2022-12-31T00:15:00Z' so it must be in the bin that starts with '2021-12-31T00:15:00Z'
| (timestamp '2022-01-02 00:00:00'), | ||
| (timestamp '2022-02-02 00:00:00'), | ||
| (timestamp '2022-02-15 00:00:00'), | ||
| (timestamp '2022-03-31 00:00:00') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
…move default origin from the tests
|
@alamb The PR is now ready for re-reviewing with all questions answered |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@alamb thank you for pining me. @NGA-TRAN thank you for the great works. the codes looks good to me. i tested this in your branch, i wonder whether this corner cases make sense ❯ select date_bin('1 month', timestamp '2000-02-29T00:00:00', timestamp '2000-01-31T00:00:00');
+----------------------------------------------------------------------------------+
| datebin(Utf8("1 month"),Utf8("2000-02-29T00:00:00"),Utf8("2000-01-31T00:00:00")) |
+----------------------------------------------------------------------------------+
| 2000-02-29T00:00:00 |
+----------------------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.❯ select date_bin('1 month', timestamp '2000-01-31T00:00:00', timestamp '2000-02-29T00:00:00');
+----------------------------------------------------------------------------------+
| datebin(Utf8("1 month"),Utf8("2000-01-31T00:00:00"),Utf8("2000-02-29T00:00:00")) |
+----------------------------------------------------------------------------------+
| 2000-01-29T00:00:00 |
+----------------------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.i feel like the mechanism behind isn't consistent. i thought that the second's is |
| if bin_time > source_date { | ||
| let month_delta = month_delta - stride_months; | ||
| bin_time = if month_delta < 0 { | ||
| origin_date - Months::new(month_delta.unsigned_abs() as u32) | ||
| } else { | ||
| origin_date + Months::new(month_delta as u32) | ||
| }; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes 2000-01-31T00:00:00 becomes 2000-01-29T00:00:00 in
❯ select date_bin('1 month', timestamp '2000-01-31T00:00:00', timestamp '2000-02-29T00:00:00');
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #5982 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, @NGA-TRAN!
Not required for this PR, but perhaps something to consider in a followup is that I thought I would experiment with caching the stride function based on the interval input to the date_bin function:
I came up with the following change, that includes a benchmark of date_bin.
For the 1,000 element timestamp array the baseline (uncached):
month-day-nano/date_bin/1000
time: [3.7791 µs 3.7870 µs 3.7958 µs]
thrpt: [263.45 Melem/s 264.06 Melem/s 264.62 Melem/s]
and after the change, the throughput, which is the number of timestamps elements mapped by the function per second, improved consistently by about 28% or from 268 million to 338 million elements per second:
month-day-nano/date_bin/1000
time: [2.9499 µs 2.9529 µs 2.9563 µs]
thrpt: [338.26 Melem/s 338.65 Melem/s 339.00 Melem/s]
change:
time: [-22.338% -22.151% -21.964%] (p = 0.00 < 0.05)
thrpt: [+28.147% +28.454% +28.763%]
Performance has improved.
Subject: [PATCH] feat: cache stride function
---
Index: datafusion/physical-expr/Cargo.toml
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/datafusion/physical-expr/Cargo.toml b/datafusion/physical-expr/Cargo.toml
--- a/datafusion/physical-expr/Cargo.toml (revision 0dc43290387b7eadb24dcf7c9fcd3576ab3f2672)
+++ b/datafusion/physical-expr/Cargo.toml (date 1681677085430)
@@ -75,3 +75,7 @@
[[bench]]
harness = false
name = "in_list"
+
+[[bench]]
+harness = false
+name = "date_bin"
Index: datafusion/physical-expr/benches/date_bin.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/datafusion/physical-expr/benches/date_bin.rs b/datafusion/physical-expr/benches/date_bin.rs
new file mode 100644
--- /dev/null (date 1681679172908)
+++ b/datafusion/physical-expr/benches/date_bin.rs (date 1681679172908)
@@ -0,0 +1,38 @@
+use arrow_array::TimestampNanosecondArray;
+use criterion::{
+ black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput,
+};
+use datafusion_common::ScalarValue;
+use datafusion_expr::ColumnarValue;
+use datafusion_physical_expr::datetime_expressions::date_bin;
+use std::sync::Arc;
+
+fn bench_mdn(c: &mut Criterion) {
+ let elem_10 =
+ TimestampNanosecondArray::from_iter_values((0..(10 as i64)).into_iter());
+ let elem_1000 =
+ TimestampNanosecondArray::from_iter_values((0..(1000 as i64)).into_iter());
+
+ let mut group = c.benchmark_group("month-day-nano");
+ for elements in [Arc::new(elem_10), Arc::new(elem_1000)].iter() {
+ group.throughput(Throughput::Elements(elements.len() as u64));
+
+ let args = vec![
+ ColumnarValue::Scalar(ScalarValue::new_interval_mdn(0, 1, 0)),
+ ColumnarValue::Array(elements.clone()),
+ ColumnarValue::Scalar(ScalarValue::TimestampNanosecond(Some(1), None)),
+ ];
+
+ group.bench_with_input(
+ BenchmarkId::new("date_bin", elements.len()),
+ &args,
+ |b, args| {
+ b.iter(|| date_bin(args));
+ },
+ );
+ }
+ group.finish();
+}
+
+criterion_group!(benches, bench_mdn);
+criterion_main!(benches);
Index: datafusion/physical-expr/src/datetime_expressions.rs
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/datafusion/physical-expr/src/datetime_expressions.rs b/datafusion/physical-expr/src/datetime_expressions.rs
--- a/datafusion/physical-expr/src/datetime_expressions.rs (revision 0dc43290387b7eadb24dcf7c9fcd3576ab3f2672)
+++ b/datafusion/physical-expr/src/datetime_expressions.rs (date 1681679592855)
@@ -428,6 +428,13 @@
Interval::Months(months) => date_bin_months_interval(*months, source, origin),
}
}
+
+ fn interval_fn(&self) -> (i64, fn(i64, i64, i64) -> i64) {
+ match self {
+ Interval::Nanoseconds(nanos) => (*nanos, date_bin_nanos_interval),
+ Interval::Months(months) => (*months, date_bin_months_interval),
+ }
+ }
}
// Supported intervals:
@@ -508,7 +515,8 @@
)),
};
- let f = |x: Option<i64>| x.map(|x| stride.bin(x, origin));
+ let (stride, stride_fn) = stride.interval_fn();
+ let f = |x: Option<i64>| x.map(|x| stride_fn(stride, x, origin));
Ok(match array {
ColumnarValue::Scalar(ScalarValue::TimestampNanosecond(v, tz_opt)) => {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great test coverage and comments, thanks @NGA-TRAN 💯
Co-authored-by: Stuart Carnie <[email protected]>
|
You have great tests. I agree they look inconsistent. I think we may want to define the behavior here and see whether we want to keep it or not: I think this one is easy to understand: the origin is Jan 31 so any other month will always be the last day of that month (28, 29, 30, 31 depending on the month and year) ❯ select date_bin('1 month', timestamp '2000-02-29T00:00:00', timestamp '2000-01-31T00:00:00');
+----------------------------------------------------------------------------------+
| datebin(Utf8("1 month"),Utf8("2000-02-29T00:00:00"),Utf8("2000-01-31T00:00:00")) |
+----------------------------------------------------------------------------------+
| 2000-02-29T00:00:00 |
+----------------------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.This is is tricky and we need to identify the behavior. ❯ select date_bin('1 month', timestamp '2000-01-31T00:00:00', timestamp '2000-02-29T00:00:00');
+----------------------------------------------------------------------------------+
| datebin(Utf8("1 month"),Utf8("2000-01-31T00:00:00"),Utf8("2000-02-29T00:00:00")) |
+----------------------------------------------------------------------------------+
| 2000-01-29T00:00:00 |
+----------------------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.For February, 29 (or 28) is the last day of the month, but for other month is it not the last day of the month. So chromo's Months that we are using in this PR do not consider 29 as the last day of the month and just simple add/minus 30 days which I think makes sense. ❯ select date_bin('1 month', timestamp '2000-01-31T00:00:00', timestamp '2000-03-29T00:00:00');
+----------------------------------------------------------------------------------+
| datebin(Utf8("1 month"),Utf8("2000-01-31T00:00:00"),Utf8("2000-03-29T00:00:00")) |
+----------------------------------------------------------------------------------+
| 2000-01-29T00:00:00 |
+----------------------------------------------------------------------------------+
1 row in set. Query took 0.005 seconds.So I think this is the behavior is acceptable. What do you, @alamb and @stuartcarnie think? |
|
I am adding @stuartcarnie 's suggestion and benchmark into this PR |
|
@alamb : I have added @stuartcarnie's the suggestion without the benchmark because I got Audit error if the benchmark is included. I am not sure what is the policy for adding benchmark |
|
That's great, thanks @NGA-TRAN! |
|
I have added the tests with origin on 29th of Feb and Mar and interval is a month. I have added explanation for the tests on why the bins are always on the 29th |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks everyone for their comments and help here
Which issue does this PR close?
Closes #5689
Rationale for this change
In IOx, we see users use
data_binwith month interval to group data by month for their monthly report. We think it is a very useful feature to support.What changes are included in this PR?
Since each month does not include the same number of nanoseconds and for date_bin, it also depends on the
origininput. This PR mainly add a function to handle month interval for date bin and some related refactoring workAre these changes tested?
Yes
Are there any user-facing changes?
Yes, date_bin with month and year interval is now working on constant input