Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Mar 30, 2023

Which issue does this PR close?

Closes #5801

Rationale for this change

  1. To consistently support intervals we need to uniformly handle that type
  2. @doki23 ported the interval parsing code uptream in Support for casting Utf8 and LargeUtf8 --> Interval arrow-rs#3762 so we should use that (in preparation of supporting intervals more fully in DataFusion)

What changes are included in this PR?

  1. Remove datafusion special parse_interval function and use arrow kernel everyhere

Are these changes tested?

yes, both existing and new tests

Are there any user-facing changes?

More consistency with how interval is handled

@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules physical-expr Changes to the physical-expr crates sql SQL Planner sqllogictest SQL Logic Tests (.slt) and removed physical-expr Changes to the physical-expr crates labels Mar 30, 2023
[dependencies]
arrow-schema = { workspace = true }
arrow = { workspace = true }
datafusion-common = { path = "../common", version = "21.0.0" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion-common already depends on arrow so this is no new (net) dependency

format!("{value} {leading_field}")
} else {
// default to seconds for the units
format!("{value} seconds")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same default as previously


/// Cast `expr` to the specified type, if possible
fn cast_expr(expr: &Expr, to_type: &DataType, schema: &DFSchema) -> Result<Expr> {
// Special case until Interval coercion is handled in arrow-rs
Copy link
Contributor Author

@alamb alamb Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 for avoiding the special cases now that arrow-rs has string --> Interval suppot

@alamb alamb force-pushed the alamb/interval_syntax branch from 83c6697 to 4eba5c2 Compare March 31, 2023 18:19
!matches!(self, &Self::UNCOMPRESSED)
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this PR is to remove this code (which has been upstreamed to arrow-rs)

test_expression!(
"interval '13 month'",
"1 years 1 mons 0 days 0 hours 0 mins 0.00 secs"
"0 years 13 mons 0 days 0 hours 0 mins 0.000000000 secs"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this matters -- but interval parsing now is month/day/nano always and thus is not quite as nice a display as it was previously (no years, instead it is always months). However perhaps if anyone cares we can improve the display in arrow-rs for this kind of interval

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waitingkuo and @ovr -- Do you have any thoughts about this proposed change (that the reported intervals are not in the same units as the original expression)?

Copy link
Contributor

@tustvold tustvold Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this was a conscious change when the parsing logic was ported, to faithfully preserve what the user specified. This was mainly in the context of rounding days to months and hours to days, which were simply incorrect, but months to years got the same treatment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this was a conscious change when the parsing logic was ported, to faithfully preserve what the user specified. This was mainly in the context of rounding days to months and hours to days, which were simply incorrect, but months to years got the same treatment

Make sense to me👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exploring how postgres handles this, if you do EXTRACT(EPOCH FROM interval), it treats months as 30 days, until you get to 12 months, and then it is 365.25 days. Not sure how much it matters to match postgres, but that's what I've seen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

"| 2020-09-08T11:42:29.190855 | 2020-09-08T11:42:30.190855 |",
"+----------------------------+--------------------------------------+",
"+----------------------------+-------------------------------------------------+",
"| ts | table_b.ts + IntervalMonthDayNano(\"1000000000\") |",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty terrible formatting for the column name 🤮 but I don't think it is fixable easily -- see #2027 for more discussion

.as_ref()
.map(|dt| dt.to_string())
.unwrap_or_else(|| "second".to_string());
let value = if has_units(&value) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty terrible hack but it seems to be necessary to get the tests to pass (aka to have postgres compatible syntax) 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be valid to just search for a character that isn't a digit or decimal point?

@alamb alamb marked this pull request as ready for review March 31, 2023 19:06

## This is incredibly confusing but document it in tests:
#
# years is parsed as a column name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusion is consistent with postgres so in that sense it is not a bug

This behavior is due to how the statement is parsed (sqlparser-rs treats years as the alias)

It is incredibly confusing (I was confused for a while)

Now that we have proper Utf8 -> Interval parsing (thanks to @doki23 ) in arrow-rs I think the preferred way to create intervals should be '5 months'::interval which does the right thing and doesn't have the wacky syntax

query ?
select interval '5' year
----
0 years 60 mons 0 days 0 hours 0 mins 0.000000000 secs
Copy link
Member

@jackwener jackwener Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little strange behavior.

I think interval '24' month -> 0 years 24 mons is reasonable.
But it's strange interval '2' year -> 0 years 24 mons.

Maybe in the future, if we can add year in interval, it will can be resolved easily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree it is a little strange. I can't figure out how important it is to to mirror back exactly what the user said.

For some things like 5 years we could pick Interval(YearMonth) and encode exactly what the user specified

For things like 5 years 3 days to express that properly I think we need to use Interval(MonthDayNano) which is going to be formatted like 60 months ... 3 days

Or maybe we should just make the interval printing prettier 🤔 like convert 60 months to 5 years...

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great job! Thanks @alamb

@alamb alamb merged commit 15c65ee into apache:main Apr 4, 2023
@alamb alamb deleted the alamb/interval_syntax branch April 4, 2023 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

INTERVAL ... SQL syntax produces inconsistent types

4 participants