Skip to content

Conversation

@MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Sep 18, 2022

Here's my proposal to address #12585

pdep4_2

@MarcoGorelli MarcoGorelli added the PDEP pandas enhancement proposal label Sep 18, 2022
@MarcoGorelli MarcoGorelli requested review from a team September 18, 2022 13:36
@ahmetanildindar
Copy link

Dear @MarcoGorelli,
I have read your commit and it seems ok to me.

Ahmet

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general question: How would this impact functions like read_csv?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 lgtm.


## Motivation and Scope

Pandas date parsing is very flexibible, but arguably too much so - see
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on flexible

Pandas date parsing is very flexibible, but arguably too much so - see
https://github.com/pandas-dev/pandas/issues/12585 and linked issues for how
much confusion this causes. Pandas can swap format midway, and though this
is document, it regularly breaks users' expectations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documented

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Sep 18, 2022

Just a general question: How would this impact functions like read_csv?

Thanks for taking a look - in read_csv, if parsing a date column fails, then the column is just kept as object

If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type

So, if read_csv would previously have parsed a column whilst switching format midway, with this PDEP it wouldn't parse it and would return it unaltered as object data type

Example

current behaviour
Swaps format midway

In [2]: pd.read_csv(io.StringIO('12-01-2000 00:00\n13-01-2000 00:00'), header=None, parse_dates=[0])[0]
Out[2]:
0   2000-12-01
1   2000-01-13
Name: 0, dtype: datetime64[ns]

new behaviour
Can't parse according to first row, so returns column unaltered as object

In [2]: pd.read_csv(io.StringIO('12-01-2000 00:00\n13-01-2000 00:00'), header=None, parse_dates=[0])[0]
Out[2]:
0    12-01-2000 00:00
1    13-01-2000 00:00
Name: 0, dtype: object

@lordgrenville
Copy link
Contributor

Thank you for this! It looks like a really good solution.

I don't know the code well enough to know how feasible this is, but in case of format inference, would it be possible to inform the user of what format has been inferred? At least maybe in case of errors. The user might not realise that there is a ['12-01-2000 00:00:00', '13-01-2000 00:00:00'] lurking somewhere in the data.

@MarcoGorelli
Copy link
Member Author

Cheers! And yeah if there's an error and the user is using errors='coerce' (the default), then pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00']) would give

ValueError: time data '13-01-2000 00:00:00' does not match format '%m-%d-%Y %H:%M:%S' (match)

@jreback
Copy link
Contributor

jreback commented Sep 18, 2022

if
errors=coerce then this wouldn't raise

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Sep 18, 2022

True, it would just show DatetimeIndex(['2000-12-01', 'NaT'], dtype='datetime64[ns]', freq=None)

Maybe there could be a verbose argument which prints out the inferred format, but I'd suggest we keep that to a separate discussion. For now, I think that if people just call .head() on the output, they can tell if the format has been inferred correctly

@rhshadrach
Copy link
Member

rhshadrach commented Sep 18, 2022

Maybe there could be a verbose argument which prints out the inferred format, but I'd suggest we keep that to a separate discussion. For now, I think that if people just call .head() on the output, they can tell if the format has been inferred correctly

Would it make sense to add guess_datetime_format to the public API instead (possibly renamed)?

from pandas._libs.tslibs.parsing import guess_datetime_format

print(guess_datetime_format('12-01-2000 00:00:00'))

# %m-%d-%Y %H:%M:%S

@scott-r
Copy link

scott-r commented Sep 18, 2022

I don't think that what is proposed here (failing to parse date strings that don't match the inferred format) is the behavior you would want all the time. It might be a good option to have, controlled by a flag parameter.

I say that because the current behavior of infer_datetime_format is never harmful. It can speed up the parsing of a long list of dates, but it will never cause errors from date strings that could have been parsed correctly without this option. In real-world cases where the date format may not be 100% consistent, I think you want the parsing to succeed whenever it can. If it's really true that infer_datetime_format is purely a performance boost with no behavior change (and this is true as far as I know), maybe it should be enabled all the time (i.e. remove the option) or at least make it true by default.

The behavior proposed here is already available with the format option, but only if you already know what the format is. I can see where it would be valuable to be able to strictly enforce an unknown (inferred) format. So perhaps this behavior could be enabled with a flag such as infer_strict_format (default false). I can see the potential for confusion with the existing infer_datetime_format but that problem could be ameliorated, as suggested above, by removing the existing flag.

@lordgrenville
Copy link
Contributor

@scott-r Disagree with

...the current behavior of infer_datetime_format is never harmful

The harm is silently inferring two different formats, when in reality there is only one.

@scott-r
Copy link

scott-r commented Sep 19, 2022

@scott-r Disagree with

...the current behavior of infer_datetime_format is never harmful

The harm is silently inferring two different formats, when in reality there is only one.

@lordgrenville To be clear, I used the phrase "never harmful" because infer_datetime_format controls a performance optimization that will not produce any more, or fewer, parsing errors than there would be if it were not used. Therefore, it's not clear to me why it even needs to exist as an option; it should be safe to enable it at all times.

Separate from that, I agree with you and @MarcoGorelli that sometimes you might want to enforce the restriction that all the date strings must adhere to a single format. That can be done today with the format option (when the single format is known) or by a new option (which I called infer_strict_format) that would enable the behavior @MarcoGorelli is proposing when the single format is not known at the outset. My only quibble is that there are valid use cases for both the current and proposed behavior; I don't think it's safe to enable the new behavior unconditionally.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Sep 19, 2022

@scott-r the current behaviour is unexpected to many and causes problems, see the many comments in #12585 and the many linked issues

The behaviour of infer_datetime_format is also unexpected to many, see here, here and here, and many more

And sure, mixed formats might be a valid use case, but they can still be parsed using apply (where at least it's clear that each format will be guessed individually), e.g.

pd.Series(['12-01-2000 00:00:00', '13 January 2000']).apply(pd.to_datetime)

so I'd be -1 on adding another boolean argument

+1 lgtm.

Thanks Jeff! I'll wait to see if others have objections, then I'll change the status to accepted

- if no ``format`` is specified, ``pandas`` will guess the format from the first non-NaN row
and parse the rest of the input according to that format. Errors will be handled
according to the ``errors`` argument - there will be no silent switching of format;
- ``infer_datetime_format`` will be deprecated;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were parsing 11-12-10, would the default be November 12, 2010, or December 11, 2010 or December 10, 2011? I think we should be explicit in this proposal on how dayfirst and yearfirst interacts with this particular behavior, especially since the docs say that those 2 parameters are not strict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know what infer_datetime_format currently does but is it not very restrictive, and not always possible, to infer from just the first element?

If that were to be the case I support the comment of outlining how to operate with year first and day first. As a European my colleagues and I hate US mm/dd/yy and I think this might also need outlining some specifics.

The advantage of providing a consistent format input is that better inference could be made from multiple samples and this is very useful for structured data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both for taking a look!

@Dr-Irv this proposal wouldn't change how dayfirst and yearfirst operate. The format will try to be guessed in accordance with these parameters, just like it is on main - the difference is that with this proposal, the format guessed from the first non-NaN row will be used to parse the rest of the Series

@attack68 in the rare case that it's not possible to guess the format from the first element, then a UserWarning would be raised, check lines 49-55 of this PR

You're very right to bring up mm/dd/yy 👍 - indeed the vast majority of the world doesn't use that format. That's why the current behaviour is so dangerous. For example, suppose your data is in %d-%m-%Y %H:%M format:

On main, the first row's date would be parsed as mm-dd-yyyy, whilst the second one as dd-mm-yyyy. No error, no warning, this is very easy to miss (and I almost did once in a prod setting 😳 ):

In [1]: pd.to_datetime(['12-01-2000 00:00', '13-01-2000 00:00'])
Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)

With this PDEP, you could just check the format of your first row, and you'd know the rest of the Series was parsed in accordance to that. If it can't be, then with errors='raise' (the default), you'd get an error

ValueError: time data '13-01-2000 00:00' does not match format '%m-%d-%Y %H:%M' (match)

and you'd see that the guessed format wasn't right. You could get around that either by explicitly passing format, or with dayfirst=True:

In [2]: pd.to_datetime(['12-01-2000 00:00', '13-01-2000 00:00'], dayfirst=True)
Out[2]: DatetimeIndex(['2000-01-12', '2000-01-13'], dtype='datetime64[ns]', freq=None)

Totally agree on better documenting this, and that inference could be optimised by using multiple samples to guess - first, I just wanted to get agreement that we want to_datetime to parse consistently

@srotondo
Copy link
Contributor

srotondo commented Sep 19, 2022

The behaviour of infer_datetime_format is also unexpected to many, see here, here and here, and many more

@MarcoGorelli What if we were to change the current function of infer_datetime_format to what you are proposing, which would cause parsing errors when the format is different so that it matches expectations, then added a different flag (perhaps called try_fast_parsing or something) with the current behavior of infer_datetime_format? That way we could fix any confusion with the current functionality, but we also wouldn't lose the functionality and options to_datetime currently has.

@MarcoGorelli
Copy link
Member Author

Thanks for taking a look @srotondo

I don't think that would fix the confusion - in #12585 and linked issues, people are using to_datetime without infer_datetime_format yet are still expecting consistent parsing

And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with .apply(pd.to_datetime)

Furthermore, I'd be -1 on adding yet another boolean argument thus increasing complexity even more

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 19, 2022

I don't think that would fix the confusion - in #12585 and linked issues, people are using to_datetime without infer_datetime_format yet are still expecting consistent parsing

And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with .apply(pd.to_datetime)

What if you keep infer_datetime_format, but change its type to str and the default behavior. Values could be (open to changing these):

  • consistent : the new default, based on this proposal. Use first item to determine format, retain that format throughout
  • inconsistent: the format may change based on how each element is parsed (current behavior with value of True ?)
  • None: Do not infer, and only use the value of format (current default with False ?)

So people with code that doesn't specify infer_datetime_format will get the new behavior. If someone wants the old behavior, they can ask for it. If they do specify it as True or False, we raise.

@srotondo
Copy link
Contributor

And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with .apply(pd.to_datetime)

@MarcoGorelli I suppose if the current functionality is still available through .apply(), that's probably fine, but do take care that any change you make to to_datetime doesn't affect the outcome of .apply(to_datetime) and also make it very clear in the documentation that that is how you get the old behavior. I'm just not very comfortable with completely removing the old functionality from to_datetime since even though you stated that many people expect this proposed behavior, it's possible some people use and rely on the current behavior. But as long as you can clearly and easily find how to get that old behavior, it's probably alright. Just please be sure that the current behavior isn't completely removed and is still accessible somehow.

@MarcoGorelli
Copy link
Member Author

@Dr-Irv sure, that'd be an option, but if there's gonna be a breaking change, I'd suggest we take the chance to simplify - having infer_datetime_format take three different options feels like too much complexity

@jbrockmendel
Copy link
Member

+1 on the big picture. Implementation-wise, is the idea to pretty much replace our usage of dateutil.parser?

However, it's read as "1st of December, 13th of January". No warning or error is thrown.

Currently, the only way to ensure consistent parsing is by explicitly passing
``format=``. The argument ``infer_datetime_format``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but related, but it would be good to mention that infer_datetime_format should be strict with respect to being mixed with format

In [1]: pd.to_datetime(["2022-01-01"], infer_datetime_format=True, format="%Y-%m-%d")
Out[1]: DatetimeIndex(['2022-01-01'], dtype='datetime64[ns]', freq=None)

# Format doesn't match the input
In [2]: pd.to_datetime(["2022-01-01"], infer_datetime_format=True, format="%m-%d-%Y")
Out[2]: DatetimeIndex(['2022-01-01'], dtype='datetime64[ns]', freq=None)

i.e. it's not great that format != None and infer_datetime_format=True

@mroeschke
Copy link
Member

If infer_datetime_format was deprecated, I do like @rhshadrach's suggestion #48621 (comment) of making guess_datetime_format public such that existing "functionality" isn't lost while making to_datetime stricter as the prior behavior would be essentially equivalent to:

to_datetime(arg, format=pd.tools.guess_datetime_format(arg[0]))

@MarcoGorelli
Copy link
Member Author

+1 on the big picture. Implementation-wise, is the idea to pretty much replace our usage of dateutil.parser?

Thanks

dateutil.parser would still be used within guess_datetime_format, but then subsequent rows would be parsed with that guessed format rather than repeatedly calling dateutil.parser and risk having it silently switch format

I'm also on board with the suggestion to make guess_datetime_format public

@MarcoGorelli
Copy link
Member Author

Thanks all for your feedback, I'll incorporated some points into the document

There's been a couple of approvals, so for now I've changed the status to accepted

@MarcoGorelli
Copy link
Member Author

There's been 3 explicit approvals from core members, no "requested changes", and this has been open all week, so merging now

Thanks all for the discussion, aiming to start working on this soon-ish

@MarcoGorelli MarcoGorelli merged commit 7d852a9 into pandas-dev:main Sep 23, 2022
@datapythonista
Copy link
Member

I couldn't review before merged, but happy with the proposal here, really nice improvement.

I'm curious what's the suggested workaround mentioned in the PDEP for parsing columns with different formats. Should we mention it in the PDEP? Or the idea is to write it in the logs when implemented?

Also, seems like few formats are broken when rendered: https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html

@MarcoGorelli
Copy link
Member Author

Thanks - yeah it's mentioned at the end of "detailed description"

If a user has dates in a mixed format, they can still use flexible parsing and accept the risks that poses, e.g.:

In [3]: pd.Series(['12-01-2000 00:00:00', '13-01-2000 00:00:00']).apply(pd.to_datetime)
Out[3]:
0   2000-12-01
1   2000-01-13
dtype: datetime64[ns]

@rhshadrach
Copy link
Member

I'm curious what's the suggested workaround mentioned in the PDEP for parsing columns with different formats.

One way I read this question (and not sure if it's the right way) is if a frame has two columns each with a different (but self-consistent) format. They would be parsed individually, is that correct @MarcoGorelli?

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Sep 24, 2022

Ah apologies, I'd misunderstood

I presume you mean in read_csv? If so, then yes, e.g.:

data = io.StringIO('13-01-2000 00:00,13 Jan 2000\n14-01-2000 00:00,14 Jan 2000\n')
print(pd.read_csv(data, header=None, parse_dates=[0, 1]))

would give

           0          1
0 2000-01-13 2000-01-13
1 2000-01-14 2000-01-14

Each columns were parsed according to the format inferred from its respective first row.

If the first row of the first column had been 12-01-2000 00:00, then the inferred format would've been month-first, and so read_csv would've returned it unaltered as object, whilst the second column would still have been successfully converted

This wouldn't deviate from the current behaviour

@datapythonista
Copy link
Member

Actually what I had in mind was answered in the first answer, I don't think my question was very clear. I missed the workaround when reading the PDEP, sorry.

## Abstract

The suggestion is that:
- ``to_datetime`` becomes strict and uses the same datetime format to parse all elements in its input.
Copy link
Member

@datapythonista datapythonista Sep 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a blank line before the list, that's markdown standard (GitHub comments allows it, but it's not allowed in the markdown spec)


The whatsnew notes read

> In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't have style for blockquote in our website. I created #48758 to fix it.

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* pdep-4: initial draft

* note about making guess_datetime_format public, out of scope work, dayfirst/yearfirst

Co-authored-by: MarcoGorelli <>
@datapythonista
Copy link
Member

@MarcoGorelli was PDEP-4 fully implemented? Should we change its state to Implemented?

Also, I see there was a revision. Would it make sense to add the link to the revision PR to Discussion field, next to the original one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PDEP pandas enhancement proposal

Projects

None yet

Development

Successfully merging this pull request may close these issues.