-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix interval to use consistent units and arrow parser
#5806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| [dependencies] | ||
| arrow-schema = { workspace = true } | ||
| arrow = { workspace = true } | ||
| datafusion-common = { path = "../common", version = "21.0.0" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion-common already depends on arrow so this is no new (net) dependency
datafusion/sql/src/expr/value.rs
Outdated
| format!("{value} {leading_field}") | ||
| } else { | ||
| // default to seconds for the units | ||
| format!("{value} seconds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same default as previously
|
|
||
| /// Cast `expr` to the specified type, if possible | ||
| fn cast_expr(expr: &Expr, to_type: &DataType, schema: &DFSchema) -> Result<Expr> { | ||
| // Special case until Interval coercion is handled in arrow-rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 for avoiding the special cases now that arrow-rs has string --> Interval suppot
83c6697 to
4eba5c2
Compare
| !matches!(self, &Self::UNCOMPRESSED) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this PR is to remove this code (which has been upstreamed to arrow-rs)
| test_expression!( | ||
| "interval '13 month'", | ||
| "1 years 1 mons 0 days 0 hours 0 mins 0.00 secs" | ||
| "0 years 13 mons 0 days 0 hours 0 mins 0.000000000 secs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this matters -- but interval parsing now is month/day/nano always and thus is not quite as nice a display as it was previously (no years, instead it is always months). However perhaps if anyone cares we can improve the display in arrow-rs for this kind of interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@waitingkuo and @ovr -- Do you have any thoughts about this proposed change (that the reported intervals are not in the same units as the original expression)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this was a conscious change when the parsing logic was ported, to faithfully preserve what the user specified. This was mainly in the context of rounding days to months and hours to days, which were simply incorrect, but months to years got the same treatment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this was a conscious change when the parsing logic was ported, to faithfully preserve what the user specified. This was mainly in the context of rounding days to months and hours to days, which were simply incorrect, but months to years got the same treatment
Make sense to me👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exploring how postgres handles this, if you do EXTRACT(EPOCH FROM interval), it treats months as 30 days, until you get to 12 months, and then it is 365.25 days. Not sure how much it matters to match postgres, but that's what I've seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯
| "| 2020-09-08T11:42:29.190855 | 2020-09-08T11:42:30.190855 |", | ||
| "+----------------------------+--------------------------------------+", | ||
| "+----------------------------+-------------------------------------------------+", | ||
| "| ts | table_b.ts + IntervalMonthDayNano(\"1000000000\") |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty terrible formatting for the column name 🤮 but I don't think it is fixable easily -- see #2027 for more discussion
| .as_ref() | ||
| .map(|dt| dt.to_string()) | ||
| .unwrap_or_else(|| "second".to_string()); | ||
| let value = if has_units(&value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty terrible hack but it seems to be necessary to get the tests to pass (aka to have postgres compatible syntax) 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be valid to just search for a character that isn't a digit or decimal point?
|
|
||
| ## This is incredibly confusing but document it in tests: | ||
| # | ||
| # years is parsed as a column name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusion is consistent with postgres so in that sense it is not a bug
This behavior is due to how the statement is parsed (sqlparser-rs treats years as the alias)
It is incredibly confusing (I was confused for a while)
Now that we have proper Utf8 -> Interval parsing (thanks to @doki23 ) in arrow-rs I think the preferred way to create intervals should be '5 months'::interval which does the right thing and doesn't have the wacky syntax
| query ? | ||
| select interval '5' year | ||
| ---- | ||
| 0 years 60 mons 0 days 0 hours 0 mins 0.000000000 secs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little strange behavior.
I think interval '24' month -> 0 years 24 mons is reasonable.
But it's strange interval '2' year -> 0 years 24 mons.
Maybe in the future, if we can add year in interval, it will can be resolved easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree it is a little strange. I can't figure out how important it is to to mirror back exactly what the user said.
For some things like 5 years we could pick Interval(YearMonth) and encode exactly what the user specified
For things like 5 years 3 days to express that properly I think we need to use Interval(MonthDayNano) which is going to be formatted like 60 months ... 3 days
Or maybe we should just make the interval printing prettier 🤔 like convert 60 months to 5 years...
jackwener
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A great job! Thanks @alamb
Which issue does this PR close?
Closes #5801
Rationale for this change
Utf8andLargeUtf8-->Intervalarrow-rs#3762 so we should use that (in preparation of supporting intervals more fully in DataFusion)What changes are included in this PR?
parse_intervalfunction and use arrow kernel everyhereAre these changes tested?
yes, both existing and new tests
Are there any user-facing changes?
More consistency with how
intervalis handled