Skip to content

Conversation

@yaooqinn
Copy link
Member

What changes were proposed in this pull request?

spark-sql> SELECT extract(dayofweek from '2009-07-26');
1
spark-sql> SELECT extract(dow from '2009-07-26');
0
spark-sql> SELECT extract(isodow from '2009-07-26');
7
spark-sql> SELECT dayofweek('2009-07-26');
1
spark-sql> SELECT weekday('2009-07-26');
6

Currently, there are 4 types of day-of-week range:

  1. the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of Sunday(1) to Saturday(7)
  2. extracting dow(3.0.0) results as of Sunday(0) to Saturday(6)
  3. extracting isodow (3.0.0) results as of Monday(1) to Sunday(7)
  4. the function weekday(2.4.0) results as of Monday(0) to Sunday(6)

Actually, extracting dayofweek and dow are both derived from PostgreSQL but have different meanings.
https://issues.apache.org/jira/browse/SPARK-23903
https://issues.apache.org/jira/browse/SPARK-28623

In this PR, we make extracting dow as same as extracting dayofweek and the dayofweek function for historical reason and not breaking anything.

Also, add more documentation to the extracting function to make extract field more clear to understand.

Why are the changes needed?

Consistency insurance

Does this PR introduce any user-facing change?

yes, doc updated and extract dow is as same as dayofweek

How was this patch tested?

  1. modified ut
  2. local SQL doc verification

before

image

after

image

@yaooqinn
Copy link
Member Author

cc @cloud-fan @dongjoon-hyun @HyukjinKwon thanks.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 17, 2020

Hi, @yaooqinn . I also know the discussion history on the old PRs, but it would be great if you can add a link to describe ANSI SQL reference in the PR description (if possible)?

cc @gatorsmile

case "DAY" | "D" | "DAYS" => DayOfMonth(source)
case "DAYOFWEEK" => DayOfWeek(source)
case "DOW" => Subtract(DayOfWeek(source), Literal(1))
case "DOW" | "DAYOFWEEK" => DayOfWeek(source)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be different from @cloud-fan 's request.
@cloud-fan . Could you confirm this?

@SparkQA

This comment has been minimized.

@yaooqinn
Copy link
Member Author

Hi, @dongjoon-hyun the ANSI SQL reference seems not to define the meaning of dow which is more likely to be defined by ISO. some of the other platforms like PostgreSQL/presto/DB2/SQL Server support this, others like Oracle don't. Those systems with dow support are also different from each other for the [0, 1]-based and [Monday, Sunday]-start method if no ’iso‘ suffix or prefix. But at least they do not vary inside their systems.

@dongjoon-hyun
Copy link
Member

Thanks. Then, did you see the following?

@yaooqinn
Copy link
Member Author

Hi, @dongjoon-hyun, thanks for adding that reference. Actually, before this work, I have reached @cloud-fan and had a discussion about this issue offline. I have got his permit about the changes made in this PR and I hope other members of the spark community to see if this is a proper change.

@SparkQA
Copy link

SparkQA commented Apr 18, 2020

Test build #121440 has finished for PR 28248 at commit 746eedf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Ya. Got it. Let's see the review comment~

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Apr 18, 2020

Test build #121448 has finished for PR 28248 at commit 746eedf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @yaooqinn .
The first target of expression description is DESC FUNCTION EXTENDED in SQL environment. The last commit shows the following. Please check the result and revert the HTML tags stuff.

spark-sql> DESC FUNCTION EXTENDED date_part;
Function: date_part
Class: org.apache.spark.sql.catalyst.expressions.DatePart
Usage: date_part(field, source) - Extracts a part of the date/timestamp or interval source.
Extended Usage:
    Arguments:
      * field - selects which part of the source should be extracted.
      <ul>
        <b> Supported string values of `field` for dates and timestamps are: </b>
          <li> "MILLENNIUM", ("MILLENNIA", "MIL", "MILS") - the conventional numbering of millennia </li>
          <li> "CENTURY", ("CENTURIES", "C", "CENT") - the conventional numbering of centuries </li>
          <li> "DECADE", ("DECADES", "DEC", "DECS") - the year field divided by 10 </li>
          <li> "YEAR", ("Y", "YEARS", "YR", "YRS") - the year field </li>

@SparkQA
Copy link

SparkQA commented Apr 19, 2020

Test build #121478 has finished for PR 28248 at commit fdb5467.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AuthRpcHandler extends AbstractAuthRpcHandler
  • public class SaslRpcHandler extends AbstractAuthRpcHandler
  • public abstract class AbstractAuthRpcHandler extends RpcHandler
  • case class Extract(field: Expression, source: Expression, child: Expression)

@yaooqinn
Copy link
Member Author

@dongjoon-hyun thanks for checking the describe function command.

It seems that */- with proper intentions can let us have it in both ways - in SQL command and SQL doc.

Please check the result below to see whether you are satisfied with.

image

+-- !query
+DESC FUNCTION EXTENDED date_part
+-- !query schema
+struct<function_desc:string>
+-- !query output
+Class: org.apache.spark.sql.catalyst.expressions.DatePart
+Extended Usage:
+    Arguments:
+      * field - selects which part of the source should be extracted.
+          - Supported string values of `field` for dates and timestamps are:
+              - "MILLENNIUM", ("MILLENNIA", "MIL", "MILS") - the conventional numbering of millennia
+              - "CENTURY", ("CENTURIES", "C", "CENT") - the conventional numbering of centuries
+              - "DECADE", ("DECADES", "DEC", "DECS") - the year field divided by 1
+              - "YEAR", ("Y", "YEARS", "YR", "YRS") - the year field
+              - "ISOYEAR" - the ISO 8601 week-numbering year that the datetime falls in
+              - "QUARTER", ("QTR") - the quarter (1 - 4) of the year that the datetime falls in
+              - "MONTH", ("MON", "MONS", "MONTHS") - the month field
+              - "WEEK", ("W", "WEEKS") - the number of the ISO 8601 week-of-week-based-year. A week is considered to start on a Monday and week 1 is the first week with >3 days. In the ISO week-numbering system, it is possible for early-January dates to be part of the 52nd or 53rd week of the previous year, and for late-December dates to be part of the first week of the next year. For example, 2005-01-02 is part of the 53rd week of year 2004, while 2012-12-31 is part of the first week of 2013
+              - "DAY", ("D", "DAYS") - the day of the month field (1 - 31)
+              - "DAYOFWEEK",("DOW") - the day of the week for datetime as Sunday(1) to Saturday(7)
+              - "ISODOW" - ISO 8601 based day of the week for datetime as Monday(1) to Sunday(7)
+              - "DOY" - the day of the year (1 - 365/366)
+              - "HOUR", ("H", "HOURS", "HR", "HRS") - The hour field (0 - 23)
+              - "MINUTE", ("M", "MIN", "MINS", "MINUTES") - the minutes field (0 - 59)
+              - "SECOND", ("S", "SEC", "SECONDS", "SECS") - the seconds field, including fractional parts
+              - "MILLISECONDS", ("MSEC", "MSECS", "MILLISECON", "MSECONDS", "MS") - the seconds field, including fractional parts, multiplied by 1000. Note that this includes full seconds
+              - "MICROSECONDS", ("USEC", "USECS", "USECONDS", "MICROSECON", "US") - The seconds field, including fractional parts, multiplied by 1000000. Note that this includes full seconds
+              - "EPOCH" - the number of seconds with fractional part in microsecond precision since 1970-01-01 00:00:00 local time (can be negative)
+          - Supported string values of `field` for interval(which consists of `months`, `days`, `microseconds`) are:
+              - "YEAR", ("Y", "YEARS", "YR", "YRS") - the total `months` / 12
+              - "MONTH", ("MON", "MONS", "MONTHS") - the total `months` modulo 12
+              - "DAY", ("D", "DAYS") - the `days` part of interval
+              - "HOUR", ("H", "HOURS", "HR", "HRS") - how many hours the `microseconds` contains
+              - "MINUTE", ("M", "MIN", "MINS", "MINUTES") - how many minutes left after taking hours from `microseconds`
+              - "SECOND", ("S", "SEC", "SECONDS", "SECS") - how many second with fractions left after taking hours and minutes from `microseconds`
+      * source - a date/timestamp or interval column from where `field` should be extracted
+
+    Examples:
+      > SELECT date_part('YEAR', TIMESTAMP '2019-08-12 01:00:00.123456');
+       2019
+      > SELECT date_part('week', timestamp'2019-08-12 01:00:00.123456');
+       33
+      > SELECT date_part('doy', DATE'2019-08-12');
+       224
+      > SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001');
+       1.000001
+      > SELECT date_part('days', interval 1 year 10 months 5 days);
+       5
+      > SELECT date_part('seconds', interval 5 hours 30 seconds 1 milliseconds 1 microseconds);
+       30.001001
+
+    Note:
+      The date_part function is equivalent to the SQL-standard function `extract`
+
+    Since: 3.0.0
+
+Function: date_part
+Usage: date_part(field, source) - Extracts a part of the date/timestamp or interval source.

case "DAY" | "D" | "DAYS" => DayOfMonth(source)
case "DAYOFWEEK" => DayOfWeek(source)
case "DOW" => Subtract(DayOfWeek(source), Literal(1))
case "DAYOFWEEK" | "DOW" => DayOfWeek(source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I said that the DOW behavior looks more reasonable, but unfortunately, we already have DAYOFWEEK in Spark 2.4 and we can't change that. It's more important to keep internal consistency.

- "YEAR", ("Y", "YEARS", "YR", "YRS") - the year field
- "ISOYEAR" - the ISO 8601 week-numbering year that the datetime falls in
- "QUARTER", ("QTR") - the quarter (1 - 4) of the year that the datetime falls in
- "MONTH", ("MON", "MONS", "MONTHS") - the month field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the month field (1 - 12)

- "EPOCH" - the number of seconds with fractional part in microsecond precision since 1970-01-01 00:00:00 local time (can be negative)
- Supported string values of `field` for interval(which consists of `months`, `days`, `microseconds`) are:
- "YEAR", ("Y", "YEARS", "YR", "YRS") - the total `months` / 12
- "MONTH", ("MON", "MONS", "MONTHS") - the total `months` modulo 12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the total `months` % 12

to be consistent with

the total `months` / 12

30.001001
""",
note = """
The _FUNC_ function is equivalent to the SQL-standard function `extract`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXTRACT(field FROM source)

30.001001
""",
note = """
The _FUNC_ function is equivalent to `date_part`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date_part(field, source)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW is EXTRACT more widely used? If yes then we should put the document in Extract.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@SparkQA
Copy link

SparkQA commented Apr 20, 2020

Test build #121494 has finished for PR 28248 at commit 0b90597.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2020

Test build #121507 has finished for PR 28248 at commit 2cedf2a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 20, 2020

Test build #121512 has finished for PR 28248 at commit 2cedf2a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 21, 2020

Test build #121565 has finished for PR 28248 at commit 10b733c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CacheManager extends Logging with AdaptiveSparkPlanHelper
  • case class AdaptiveExecutionContext(session: SparkSession, qe: QueryExecution)

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 21, 2020

Test build #121573 has finished for PR 28248 at commit 10b733c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CacheManager extends Logging with AdaptiveSparkPlanHelper
  • case class AdaptiveExecutionContext(session: SparkSession, qe: QueryExecution)

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

cloud-fan pushed a commit that referenced this pull request Apr 21, 2020
…ession and dayofweek function

### What changes were proposed in this pull request?
```sql
spark-sql> SELECT extract(dayofweek from '2009-07-26');
1
spark-sql> SELECT extract(dow from '2009-07-26');
0
spark-sql> SELECT extract(isodow from '2009-07-26');
7
spark-sql> SELECT dayofweek('2009-07-26');
1
spark-sql> SELECT weekday('2009-07-26');
6
```
Currently, there are 4 types of day-of-week range:
1. the function `dayofweek`(2.3.0) and extracting `dayofweek`(2.4.0) result as of Sunday(1) to Saturday(7)
2. extracting `dow`(3.0.0) results as of Sunday(0) to Saturday(6)
3. extracting` isodow` (3.0.0) results as of Monday(1) to Sunday(7)
4. the function `weekday`(2.4.0) results as of Monday(0) to Sunday(6)

Actually, extracting `dayofweek` and `dow` are both derived from PostgreSQL but have different meanings.
https://issues.apache.org/jira/browse/SPARK-23903
https://issues.apache.org/jira/browse/SPARK-28623

In this PR, we make extracting `dow` as same as extracting `dayofweek` and the `dayofweek` function for historical reason and not breaking anything.

Also, add more documentation to the extracting function to make extract field more clear to understand.

### Why are the changes needed?

Consistency insurance

### Does this PR introduce any user-facing change?

yes, doc updated and extract `dow` is as same as `dayofweek`

### How was this patch tested?

1. modified ut
2. local SQL doc verification
#### before
![image](https://user-images.githubusercontent.com/8326978/79601949-3535b100-811c-11ea-957b-a33d68641181.png)

#### after
![image](https://user-images.githubusercontent.com/8326978/79601847-12a39800-811c-11ea-8ff6-aa329255d099.png)

Closes #28248 from yaooqinn/SPARK-31474.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 1985437)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan cloud-fan closed this in 1985437 Apr 21, 2020
@cloud-fan cloud-fan changed the title [SPARK-31474][SQL] Consistency between dayofweek/dow in extract exprsession and dayofweek function [SPARK-31474][SQL] Consistency between dayofweek/dow in extract expression and dayofweek function Apr 21, 2020
HyukjinKwon pushed a commit that referenced this pull request Apr 23, 2020
…name in the note field of expression info

### What changes were proposed in this pull request?

\_FUNC\_ is used in note() of `ExpressionDescription` since #28248, it can be more cases later, we should replace it with function name for documentation

### Why are the changes needed?

doc fix

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

pass Jenkins, and verify locally with Jekyll serve

Closes #28305 from yaooqinn/SPARK-31474-F.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Apr 23, 2020
…name in the note field of expression info

### What changes were proposed in this pull request?

\_FUNC\_ is used in note() of `ExpressionDescription` since #28248, it can be more cases later, we should replace it with function name for documentation

### Why are the changes needed?

doc fix

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

pass Jenkins, and verify locally with Jekyll serve

Closes #28305 from yaooqinn/SPARK-31474-F.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 3b57921)
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants