Format `Date32` to string given timestamp specifiers #15361

friendlymatthew · 2025-03-22T21:22:53Z

Closes Feature: support to_char(date, timstamp format) #14536

Rationale for this change

Datafusion currently errs when attempting to format a date using time-related specifiers.

> select to_char('2023-09-04'::date, '%Y-%m-%dT%H:%M:%S%.3f');
Execution error: Cast error: Format error

However, Postgres supports this feature as it implicitly treats the date as a timestamp.

Rather than eagerly casting every Date32 to a Date64 when calling to_char, this commit attempts to first format a Date32 with the supplied format string. If the formatting fails, we try to reformat as a Date64. This way, only format strings with time-related specifiers endure the intermediary cast.

All changes are tested, specifically for two different call sites: _to_char_scalar and _to_char_array.

alamb

Thanks @friendlymatthew -- something seems strange to me about this PR in that it is catching errors and reformatting. It seems like the allowable format options should be known based on the input data type 🤔

datafusion/functions/src/datetime/to_char.rs

alamb · 2025-03-30T12:09:44Z

datafusion/functions/src/datetime/to_char.rs


+    // eagerly cast Date32 values to Date64 to support date formatting with time-related specifiers
+    // without error.
+    if data_type == &Date32 {


as @Omega359 says, this will now penalize performance for all existing Date32 columns

Is there any way we can check if the format string contains any time related specifiers before doing this conversion?

It is possible to check for time formats in the format string however I'm not sure that cost would be any less than just doing the conversion up front. I think we need a benchmark to have quantified data vs assumptions.

Some data from a quick little benchmark I wrote

test_date32_to_date64_cast_array_1000 time: [311.06 ns 314.48 ns 318.16 ns] Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe test_parse_performance_with_time_specifiers_array_1000 time: [343.79 ns 351.64 ns 359.98 ns] Found 14 outliers among 100 measurements (14.00%) 10 (10.00%) low mild 3 (3.00%) high mild 1 (1.00%) high severe test_parse_performance_without_time_specifiers_array_1000 time: [196.59 µs 201.06 µs 206.45 µs] Found 10 outliers among 100 measurements (10.00%) 2 (2.00%) high mild 8 (8.00%) high severe

test_date32_to_date64_cast_array_1000: just casts from date32 to date64

test_parse_performance_with_time_specifiers_array_1000: parses the formats looking for time specifiers and when found does the cast from date32 to date64

test_parse_performance_without_time_specifiers_array_1000: parses the formats looking for time specifiers (doesn't find anything), no cast

Note that results on my machine will vary +-10% between runs. The check for time specifiers in the format is simple and could be improved but I think is enough to show general performance. Note that the list of time specifiers is not complete

const TIME_SPECIFIERS: &[&str] = &[ "%H", "%M", "%S", "%c", "%f", "%k", "%I", "%l", "%p", "%R", "%T", "%x", "%r", "%Z", ]; fn has_time_specifiers(str: &str) -> bool { for specifier in TIME_SPECIFIERS { if str.contains(specifier) { return true; } } false }

A couple of takeaways from this. casting is not as cheap as I thought however parsing is seems to be more expensive than that but perhaps with some really good optimizations it could be reduced.

I don't see a good way to make this feature have no cost with Date32 && no time specifiers in the format.

Omega359 · 2025-04-02T21:33:21Z

I went ahead and did a quick poc using format parsing, you can see it @ https://github.com/Omega359/arrow-datafusion/tree/to_char_date32_with_time_formats (main...Omega359:arrow-datafusion:to_char_date32_with_time_formats)

I haven't done a comparison with the benchmark results from main yet.

friendlymatthew · 2025-04-02T21:47:05Z

I went ahead and did a quick poc using format parsing, you can see it @ https://github.com/Omega359/arrow-datafusion/tree/to_char_date32_with_time_formats (main...Omega359:arrow-datafusion:to_char_date32_with_time_formats)

I haven't done a comparison with the benchmark results from main yet.

I was planning on getting to this during the weekend, but wow, this is super awesome @Omega359! I love how has_time_specifier will early return true when encountering the first time-related specifier.

Let me know how you want to proceed, I'd be happy to help out with benchmarking.

Omega359 · 2025-04-03T13:50:44Z

@friendlymatthew - Running the benchmarks between main and the sidebranch would be very helpful - it should give us an idea as to the overhead of the format parsing for date32 data.

friendlymatthew · 2025-04-03T16:42:42Z

@friendlymatthew - Running the benchmarks between main and the sidebranch would be very helpful - it should give us an idea as to the overhead of the format parsing for date32 data.

Sounds great. I'll have some free time tomorrow to work on this.

friendlymatthew · 2025-04-04T16:43:19Z

I went ahead and did a quick poc using format parsing, you can see it @ https://github.com/Omega359/arrow-datafusion/tree/to_char_date32_with_time_formats (main...Omega359:arrow-datafusion:to_char_date32_with_time_formats)

I haven't done a comparison with the benchmark results from main yet.

@Omega359 would you be open to committing to this branch? This way, I could base the benchmarking work on top of your work.

If we decide to on a different approach than the experimented one, we could just roll back to a prior commit.

Omega359 · 2025-04-04T17:07:41Z

Your branch you mean? I can try and push a PR to your branch once I figure out how :)

friendlymatthew · 2025-04-04T17:16:03Z

Your branch you mean? I can try and push a PR to your branch once I figure out how :)

I just sent an invite to collaborate on my fork. From there you should be able to directly push a commit to this branch!

Omega359 · 2025-04-05T19:11:40Z

I messed up and pushed directly to your branch vs a PR, sorry about that @friendlymatthew

friendlymatthew · 2025-04-05T20:12:35Z

I messed up and pushed directly to your branch vs a PR, sorry about that @friendlymatthew

Not at all, thank you and happy to collaborate on this together!

friendlymatthew · 2025-04-06T16:42:27Z

Some benchmark results when comparing main with this branch:

(TLDR) This branch will make the existing to_char paths slower, at the cost of adding the ability to format time-specifiers.

I compared two benchmark functions: to_char_array_date_only_patterns_1000 and to_char_scalar_date_only_pattern_1000.

These functions are concerned with formatting dates with date-only specifiers.

From main:

to_char_array_date_only_patterns_1000
                        time:   [136.39 µs 136.68 µs 137.03 µs]

to_char_scalar_date_only_pattern_1000
                        time:   [92.084 µs 95.999 µs 100.12 µs]

And when compared to this branch:

to_char_array_date_only_patterns_1000
                        time:   [173.46 µs 174.21 µs 175.10 µs]
                        change: [+25.159% +25.881% +26.651%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild


to_char_scalar_date_only_pattern_1000
                        time:   [105.04 µs 109.46 µs 113.80 µs]
                        change: [+6.6648% +11.410% +16.429%] (p = 0.00 < 0.05)
                        Performance has regressed.

To reproduce:

You can run the two benchmark functions by:

cargo bench --package datafusion-functions "to_char_array_date_only_patterns_1000|to_char_scalar_date_only_pattern_1000"

Since the benchmark function names clash between main and this branch, you can consider this branch as main. (This branch only contains the refactor benchmark logic).

cc/ @Omega359

alamb · 2025-04-07T11:28:13Z

(TLDR) This branch will make the existing to_char paths slower, at the cost of adding the ability to format time-specifiers.

Slowing down existing functionality for current users for some new functionality doesn't sound like a great tradeoff to me -- if users want this behavior they can always override the to_char implementation

If we can retain the same performance for existing to_char but add the new feature I think that is much more reasonable

Omega359 · 2025-04-07T13:23:15Z

It's only slower when date32 is the input type. It's going to be hard to have this functionality for formats with timestamp specifiers but not have some sort of impact on those with just date specifiers. The only way I can think of to handle that is via catching the error when the parsing fails and retrying with date64.

To put some context on this:

The 25% is for the case where the use has an array of 1000 dates and 1000 formats (the table case)
The 7% is for the far more common case of 1000 dates and a single format. That is the overhead of chrono parsing a format string and doing matches on that.
This does not affect any other input types.

friendlymatthew · 2025-04-07T13:53:10Z

It may be beneficial to leave the existing code paths undisturbed and decide whether to retry on err.

It avoids the overhead of custom chrono::Strftime::parse validation. Plus, the retry logic for to_char_scalar is simple to implement. Something like:

to_char_scalar

  let formatter = ArrayFormatter::try_new(array.as_ref(), &format_options)?;
  let formatted: Result<Vec<Option<String>>, ArrowError> = (0..array.len())
      .map(|i| {
          if array.is_null(i) {
              Ok(None)
          } else {
              formatter.value(i).try_to_string().map(Some)
          }
      })
      .collect(); /*
                           will stop iterating upon the first err and 
                           since we're dealing with one format string, 
                           it'll halt after the first format attempt
                    */

  if let Ok(formatted) = formatted {
      if is_scalar_expression {
          Ok(ColumnarValue::Scalar(ScalarValue::Utf8(
              formatted.first().unwrap().clone(),
          )))
      } else {
          Ok(ColumnarValue::Array(
              Arc::new(StringArray::from(formatted)) as ArrayRef
          ))
      }
  } else {
      // if the format attempt is with a Date32, it's possible the attempt failed because
      // the format string contained time-specifiers, so we'll retry as Date64s
      if data_type == &Date32 {
          return to_char_scalar(expression.clone().cast_to(&Date64, None)?, format);
      }

      exec_err!("{}", formatted.unwrap_err())
  }

to_char_array is a bit more complex (see #15361 (comment)), but I'll think it through.

FWIW, it's worth implementing the retry logic and benchmarking the performance. I'm happy to do the work. The decision between eager casting and selective retry likely comes down to whether we're willing to slightly slow down existing behavior in exchange for better performance in the new case. But I'm just spitballing.

…ny time specifiers present.

friendlymatthew · 2025-06-28T02:44:31Z

I'll go ahead and close this

Omega359 · 2025-07-01T17:36:00Z

@friendlymatthew - I think this actually should be reopened and just be updated to the previous version of the code then merged in - I think it's a worthy addition to DF. I think this issue just unfortunately dropped off the radar at the end :(

friendlymatthew · 2025-07-01T18:14:40Z

@friendlymatthew - I think this actually should be reopened and just be updated to the previous version of the code then merged in - I think it's a worthy addition to DF. I think this issue just unfortunately dropped off the radar at the end :(

Sure

alamb · 2025-07-01T20:58:57Z

Thanks @Omega359 and @friendlymatthew -- when you are happy with it, please ping me and I will give it a final review

Omega359 · 2025-08-05T18:21:36Z

@friendlymatthew - Are you be able to update this with the previous version of the code or would you mind if I made the changes either against your branch or a new branch and PR (giving you credit for it of course since you did the work!) ?

friendlymatthew · 2025-08-06T13:31:11Z

@friendlymatthew - Are you be able to update this with the previous version of the code or would you mind if I made the changes either against your branch or a new branch and PR (giving you credit for it of course since you did the work!) ?

Hi, please feel free to take over. You should have write access to my fork; you can also make a new PR if you'd like.

I'm a bit swamped with work-- I need to close out some projects and go back to work related to arrow

This reverts commit b379fe1.

…e32-with-ts-format # Conflicts: # datafusion/functions/benches/to_char.rs

Omega359 · 2025-08-11T00:00:00Z

I've taken the liberty to revert b379fe1 and to update the code to reflect changes in main. This should be ready to review now @alamb

alamb · 2025-08-12T15:35:15Z

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing friendlymatthew/date32-with-ts-format (9ab910b) to 7d52145 diff
BENCH_NAME=to_char
BENCH_COMMAND=cargo bench --bench to_char
BENCH_FILTER=
BENCH_BRANCH_NAME=friendlymatthew_date32-with-ts-format
Results will be posted here when complete

alamb · 2025-08-12T15:42:00Z

🤖: Benchmark completed

Details

group                                    friendlymatthew_date32-with-ts-format    main
-----                                    -------------------------------------    ----
to_char_array_array_1000                                                          1.00    240.9±0.78µs        ? ?/sec
to_char_array_date_only_patterns_1000    1.00    242.8±1.12µs        ? ?/sec    
to_char_array_datetime_patterns_1000     1.00    853.6±5.00µs        ? ?/sec    
to_char_array_mixed_patterns_1000        1.00   579.5±11.75µs        ? ?/sec    
to_char_array_scalar_1000                                                         1.00    218.7±0.48µs        ? ?/sec
to_char_scalar_1000                      1.00   716.5±21.37ns        ? ?/sec    
to_char_scalar_date_only_pattern_1000    1.00   210.4±20.22µs        ? ?/sec    
to_char_scalar_datetime_pattern_1000     1.00   384.3±66.98µs        ? ?/sec    
to_char_scalar_scalar_1000                                                        1.00    877.8±1.81ns        ? ?/sec

alamb

Thank you @friendlymatthew and @Omega359

While I suspect there is more performance to be had with this implementation, this looks like a step forward to me.

Thank you

alamb · 2025-08-12T15:44:06Z

datafusion/functions/src/datetime/to_char.rs

+        // if the data type was a Date32, formatting could have failed because the format string
+        // contained datetime specifiers, so we'll retry by casting the date array as a timestamp array
+        if data_type == &Date32 {
+            return to_char_scalar(expression.clone().cast_to(&Date64, None)?, format);


this clone seems unecessary. I'll make a PR to try and avoid some clones as a follow up

Omega359 · 2025-08-12T16:11:15Z

🤖 ./gh_compare_branch_bench.sh Benchmark Script

BTW, that script looks like it could be interesting but I don't see it in your repo. Any chance you could push that one up?

alamb · 2025-08-12T16:27:16Z

🤖 ./gh_compare_branch_bench.sh Benchmark Script

BTW, that script looks like it could be interesting but I don't see it in your repo. Any chance you could push that one up?

Soryr -- I think the link is wrong. The script is here: https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch_bench.sh

I will update the checkout

alamb

I also filed a ticket to track improving the performance even more

#17152

alamb · 2025-08-19T19:04:59Z

Thanks again for helping get this over the line @Omega359

github-actions bot added the functions Changes to functions implementation label Mar 22, 2025

friendlymatthew force-pushed the friendlymatthew/date32-with-ts-format branch from 52e1d29 to 1e57ed1 Compare March 22, 2025 21:24

alamb reviewed Mar 23, 2025

View reviewed changes

datafusion/functions/src/datetime/to_char.rs Show resolved Hide resolved

friendlymatthew marked this pull request as draft March 26, 2025 16:19

friendlymatthew marked this pull request as ready for review March 27, 2025 03:00

friendlymatthew force-pushed the friendlymatthew/date32-with-ts-format branch from 88d304e to ec2373a Compare March 27, 2025 03:24

xudong963 mentioned this pull request Mar 27, 2025

Release DataFusion 47.0.0 (April 2025) #15072

Closed

39 tasks

friendlymatthew requested a review from alamb March 29, 2025 13:59

alamb reviewed Mar 30, 2025

View reviewed changes

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 5, 2025

friendlymatthew force-pushed the friendlymatthew/date32-with-ts-format branch 2 times, most recently from a855bb0 to c69180b Compare April 10, 2025 15:10

friendlymatthew and others added 5 commits April 10, 2025 11:11

Format Date32 to string given timestamp specifiers

73f7128

Eagerly cast Date32 to Date64 before formatting

3324c94

Remove superfluous clone

6930caa

POC for parsing format with StrftimeItems to determine if there are a…

deaabe4

…ny time specifiers present.

Update benchmark code

01b1bc0

github-actions bot added the Stale PR has not had any activity for some time label Jun 28, 2025

friendlymatthew closed this Jun 28, 2025

friendlymatthew reopened this Jul 1, 2025

github-actions bot removed the Stale PR has not had any activity for some time label Jul 2, 2025

alamb mentioned this pull request Jul 16, 2025

Release DataFusion 50.0.0 (Aug/Sep 2025) #16799

Closed

45 tasks

Omega359 added 5 commits August 10, 2025 21:50

Revert "Keep track of consecutive Date32 failures"

ed961cb

This reverts commit b379fe1.

Updated example filename.

6c8d1eb

Merge remote-tracking branch 'upstream/main' into friendlymatthew/dat…

fae77f7

…e32-with-ts-format # Conflicts: # datafusion/functions/benches/to_char.rs

Updates to reflect changes in main.

22cb09a

Documentation update.

50da845

github-actions bot added the documentation Improvements or additions to documentation label Aug 10, 2025

Merge branch 'main' into friendlymatthew/date32-with-ts-format

9ab910b

alamb approved these changes Aug 12, 2025

View reviewed changes

This was referenced Aug 12, 2025

Cleanup to_char to avoid some clone #17151

Draft

Optimize to_char for array conversions #17152

Open

alamb reviewed Aug 12, 2025

View reviewed changes

alamb merged commit afc0537 into apache:main Aug 19, 2025
28 checks passed

alamb mentioned this pull request Sep 14, 2025

Release DataFusion 51.0.0 (Nov 2025) #17558

Open

25 tasks

Format Date32 to string given timestamp specifiers #15361

Format Date32 to string given timestamp specifiers #15361

Uh oh!

Conversation

friendlymatthew commented Mar 22, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb Mar 30, 2025

Choose a reason for hiding this comment

Uh oh!

Omega359 Mar 30, 2025

Choose a reason for hiding this comment

Uh oh!

Omega359 Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Omega359 commented Apr 2, 2025

Uh oh!

friendlymatthew commented Apr 2, 2025

Uh oh!

Omega359 commented Apr 3, 2025

Uh oh!

friendlymatthew commented Apr 3, 2025

Uh oh!

friendlymatthew commented Apr 4, 2025

Uh oh!

Omega359 commented Apr 4, 2025

Uh oh!

friendlymatthew commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Omega359 commented Apr 5, 2025

Uh oh!

friendlymatthew commented Apr 5, 2025

Uh oh!

friendlymatthew commented Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To reproduce:

Uh oh!

alamb commented Apr 7, 2025

Uh oh!

Omega359 commented Apr 7, 2025

Uh oh!

friendlymatthew commented Apr 7, 2025

Uh oh!

friendlymatthew commented Jun 28, 2025

Uh oh!

Omega359 commented Jul 1, 2025

Uh oh!

friendlymatthew commented Jul 1, 2025

Uh oh!

alamb commented Jul 1, 2025

Uh oh!

Omega359 commented Aug 5, 2025

Uh oh!

friendlymatthew commented Aug 6, 2025

Uh oh!

Omega359 commented Aug 11, 2025

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Omega359 commented Aug 12, 2025

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

alamb left a comment

Format `Date32` to string given timestamp specifiers #15361

Format `Date32` to string given timestamp specifiers #15361

friendlymatthew commented Mar 22, 2025 •

edited by alamb

Loading

Omega359 Mar 30, 2025 •

edited

Loading

friendlymatthew commented Apr 4, 2025 •

edited

Loading

friendlymatthew commented Apr 6, 2025 •

edited

Loading