Skip to content

Conversation

@srotondo
Copy link
Contributor

The problem reported by this issue is that success in parsing a date string depends on how errors are reported (coerce vs raise). How the error is reported should not affect whether or not a parsing error occurs. Initially it appears that experiment 1 should return an error (as experiment 3 does) because the second date string does not match the format established by the first string. However, there is a fallback mechanism that is supposed to try parsing again without the effect of the infer_datetime_format flag. So in fact, experiment 3 should succeed in parsing the date as experiment 1 does.

The problem is that the fallback mechanism only works when errors=raise. This bug fix applies the same logic when errors=coerce so that the fallback parsing will take place regardless of how errors are reported.

@datapythonista
Copy link
Member

Thanks for the PR @srotondo.

Do you mind reverting the unrelated changes (removing or adding blank lines) please? Also, we need to add a test that captures the problem, fails with the original code, and passes after your changes. The issue number, better have it as a comment in the test, not in the code.

Also, we need a release note with the change. You can have a look at any merged bug PRs as a reference.

@datapythonista datapythonista added the Error Reporting Incorrect or improved errors from pandas label Jul 21, 2022
@srotondo
Copy link
Contributor Author

srotondo commented Aug 3, 2022

@datapythonista I've added the test and the release note to this code, is there anything else you would like me to do?

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, and thanks for the changes @srotondo. Can you have a look at the comment please?

@scott-r
Copy link

scott-r commented Aug 10, 2022

The fallback behavior is apparent in the code but not mentioned in the documentation for infer_datetime_format.

infer_datetime_format: bool, default False
If True and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.

I think it would be accurate to add one more sentence at the end:

If subsequent datetime strings do not follow the inferred format, parsing will fall back to the slower method of determining the format for each string individually.

That should be sufficient to remove the expectation that an error will be raised without committing to an exact description of the fallback parsing.

@srotondo
Copy link
Contributor Author

@datapythonista I believe I have made the appropriate changes. Is there anything else?

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srotondo, this makes more sense. I suggest some changes to make code more readable, see what you can do, and I'll review again when it's easier to see every case being tested.

es1 = pd.Series(["1/1/2000", "7/12/1200"])
es2 = pd.Series(["1/1/2000", "Hello"])

if error == "coerce":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes @srotondo. This test is a bit complex and not very easy to read. I think we could make things more readable if we have if for example split this in different tests. Something like:

  • test_to_datetime_infer_datetime_format_with_different_formats
  • test_to_datetime_infer_datetime_format_with_nat
  • test_to_datetime_infer_datetime_format_with_out_of_bounds
  • test_to_datetime_infer_datetime_format_with_invalid_date

What do you think? Then each test would be quite easy to read. Or maybe by parametrizing the input and the two outputs (coerce and raise). This will avoid a bit of unrepeated code and be quite concise, but probably not so readable.

See what makes sense to you, but would be good to make things as simple as possible, as the pandas codebase is already huge and too complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista thank you for your feedback on the tests. Because the cases are all closely related, I think it will be easier to understand as one test, but I did simplify some of the test and added more comments to clarify parts of the test.

expected2 = pd.Series([pd.Timestamp("2000-01-01 00:00:00"), pd.NaT])

es1 = pd.Series(["1/1/2000", "7/12/1200"])
es2 = pd.Series(["1/1/2000", "Hello"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of Hello use Invalid date or something like that, so it's easier to understand what we're doing here.

@srotondo
Copy link
Contributor Author

@datapythonista I believe that I have made all the changes you have requested, including simplifying the tests. Is there anything else you would like me to do?

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I continue to find this test too hard to read. And very hard to understand what's wrong when it fails. I need to insist on using good testing practices, and restructure in a way that make tests atomic and simple. Like not having if statements, not having more than one expected value (unless using parametrization)...

If you don't find a better, way, using the previously suggested tests is a good start.

@srotondo
Copy link
Contributor Author

srotondo commented Sep 2, 2022

@datapythonista I've split up the test into multiple pieces like you asked.

@datapythonista
Copy link
Member

Thanks @srotondo, this is more readable now.

But looks like you're somehow reimplementing the tests in here, no? Would be good to start by adding the exact case or test that your fix addresses, which is not immediately clear to me. Thanks!

@srotondo
Copy link
Contributor Author

srotondo commented Sep 6, 2022

But looks like you're somehow reimplementing the tests in here, no? Would be good to start by adding the exact case or test that your fix addresses, which is not immediately clear to me. Thanks!

@datapythonista It might look like some of these tests are reimplementations, but there are some key differences in the cases that are being tested. Given the line number you linked in that test, I assume you are mainly referring to these 3 tests in that file: test_datetime_invalid_scalar, test_datetime_outofbounds_scalar, and test_datetime_invalid_index. While they do share a few similarities, there are some important key differences between these tests and my added tests.

To start with, test_datetime_invalid_scalar and test_datetime_outofbounds_scalar only deal in scalar values, unlike my tests, which deal with lists. The difference from test_datetime_invalid_index lies largely in the data being tested. The list used in test_datetime_invalid_index is entirely made up of values that will not parse, while the lists in my tests always start with a valid date. This is important because infer_datetime_format only becomes relevant on the value after the first valid date, since it would need a format already to infer from. Therefore this test doesn't really check the fallback code as my tests are meant to.

@datapythonista
Copy link
Member

Thanks @srotondo that makes sense. Is there a reason why the tests you added are not next to the ones I mentioned? Seems like they should be next to each other. Or, not sure if it could make sense to add an extra assert in those lines to check the same cases when the invalid values are in a Series following a valid value.

And it's still not clear to me what's the exact case that you're fixing. Or are all the tests that you added failing without the change in the code function?

@srotondo
Copy link
Contributor Author

srotondo commented Sep 6, 2022

@datapythonista The true failing case is that when 2 valid dates with different formats were passed in using the errors = "coerce" option and infer_datetime_format = True, to_datetime would return NaT for the second value, when it should correctly parse the second value. The second and third tests I wrote ensure that coerce still returns NaT when it should return NaT and that the cases where errors = "raise" still raises exceptions when it should, respectively.

The choice of file for this test is mostly arbitrary, and I can move it if you'd like. However, could you please tell me if there is anything else you would like me to change as well so that I can do all of it at the same time and we can get this bugfix merged.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be accurate to add one more sentence at the end:

If subsequent datetime strings do not follow the inferred format, parsing will fall back to the slower method of determining the format for each string individually.

Hey - first, sorry to block this, and thanks for your work here

Just wanted to ask whether this is really what would be best - why not make infer_datetime_format strict:

  • if the format is inferred, and some row doesn't conform to it, it just errors or coerces, rather than falling back to the slower path

This would involve removing the special-casing here

if not infer_datetime_format:
if errors == "raise":
raise
elif errors == "coerce":
result = np.empty(arg.shape, dtype="M8[ns]")
iresult = result.view("i8")
iresult.fill(iNaT)
else:
result = arg
else:
# Indicates to the caller to fallback to objects_to_datetime64ns
return None

and just have

        if errors == "raise":
            raise
        elif errors == "coerce":
            result = np.empty(arg.shape, dtype="M8[ns]")
            iresult = result.view("i8")
            iresult.fill(iNaT)
        else:
            result = arg

This would also solve the linked issue, and rather than adding a new if-then statement, it'd remove one

@srotondo
Copy link
Contributor Author

Just wanted to ask whether this is really what would be best - why not make infer_datetime_format strict:

  • if the format is inferred, and some row doesn't conform to it, it just errors or coerces, rather than falling back to the slower path

@MarcoGorelli There exists large amounts of code with the express purpose of causing the operation to fall back if the formats aren't the same. The original documentation seems to strongly suggest that infer_datetime_format exists to cause parsing to speed up whenever possible, rather than causing otherwise valid dates to fail to parse. Also, given how many different ways a date string can be written, forcing all inputs to to_datetime to be the same format doesn't seem like the best course of action to me. I think that making this operation strict would go against the intended functionality.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Sep 16, 2022

Currently, the docs don't specify what happens if the format can be inferred, but subsequent rows don't match that format

And users don't expect the format to change midway, especially when using this option, e.g. as in #46210

cc @mroeschke @jreback @jorisvandenbossche if you have comments

TL;DR: I'm suggesting that when using infer_datetime_format=True, the format detected from the first non-NaN value should be used to parse the rest of the Series, exactly as if the user had passed it to format=

Example
Concretely, I think the following two should behave the same:

pd.to_datetime(['01-31-2000', '20-01-2000'], infer_datetime_format=True, errors='raise')
pd.to_datetime(['01-31-2000', '20-01-2000'], format='%m-%d-%Y', errors='raise')

Currently, the latter raises (as expected) whereas the former swaps format midway

@srotondo
Copy link
Contributor Author

@MarcoGorelli Even though it's not very well documented, looking at the code and following the pathway makes it very clear that the fallback is intentional. If you would like to change the behavior, I think that should be a separate PR. This code change deals with the fact that raise and coerce were behaving inconsistently with infer_datetime_format, which has been fixed. I think that removing the fallback behavior altogether is a much larger and separate issue.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - just had a closer look, and indeed the user guide does explicitly note:

pandas will fallback to the usual parsing if either the format cannot be guessed or the format that was guessed cannot properly parse the entire column of strings

That's fine then, I'll open a separate issue about changing that

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix itself looks good to me, thanks! I just have some comments about the tests:

  1. is the location right? Wouldn't /pandas/tests/tools/test_to_datetime.py be better?
  2. couldn't we just cover all these new tests by parametrising over errors in /pandas/tests/tools/test_to_datetime.py::TestToDatetimeInferFormat::test_to_datetime_infer_datetime_format_inconsistent_format ?

Apologies for how long reviews are taking, I'll do my best to help push this across the line now

"error",
["coerce", "raise"],
)
def test_fallback_different_formats(self, error):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test duplicative of

@pytest.mark.parametrize(
"data",
[
["01/01/2011 00:00:00", "01-02-2011 00:00:00", "2011-01-03T00:00:00"],
["Jan/01/2011", "Feb/01/2011", "Mar/01/2011"],
],
)
def test_to_datetime_infer_datetime_format_inconsistent_format(self, cache, data):
ser = Series(np.array(data))
# When the format is inconsistent, infer_datetime_format should just
# fallback to the default parsing
tm.assert_series_equal(
to_datetime(ser, infer_datetime_format=False, cache=cache),
to_datetime(ser, infer_datetime_format=True, cache=cache),
)

?

If so, why not just add

    @pytest.mark.parametrize('errors', ['coerce', 'raise'])

to that test?

"dateseries",
[
pd.Series(["1/1/2000", "7/12/1200"]),
pd.Series(["1/1/2000", "Invalid input"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this duplicative of

@pytest.mark.parametrize("values", [["a"], ["00:01:99"], ["a", "b", "99:00:00"]])
@pytest.mark.parametrize("infer", [True, False])
@pytest.mark.parametrize("format", [None, "H%:M%:S%"])
def test_datetime_invalid_index(self, values, format, infer):
# GH24763
res = to_datetime(
values, errors="ignore", format=format, infer_datetime_format=infer
)
tm.assert_index_equal(res, Index(values))
res = to_datetime(
values, errors="coerce", format=format, infer_datetime_format=infer
)
tm.assert_index_equal(res, DatetimeIndex([NaT] * len(values)))
msg = (
"is a bad directive in format|"
f"Given date string {values[0]} not likely a datetime|"
"second must be in 0..59"
)
with pytest.raises(ValueError, match=msg):
to_datetime(
values, errors="raise", format=format, infer_datetime_format=infer
)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's a lot of comments there, so I'm going to try to answer all these questions here in this response here. Please tell me if it's not clear which comment I'm talking about at any given point here.

I was asked about these last 4 comments before, so I'll link my previous response here so you can see it in case you missed it.

It might look like some of these tests are reimplementations, but there are some key differences in the cases that are being tested. Given the line number you linked in that test, I assume you are mainly referring to these 3 tests in that file: test_datetime_invalid_scalar, test_datetime_outofbounds_scalar, and test_datetime_invalid_index. While they do share a few similarities, there are some important key differences between these tests and my added tests.

To start with, test_datetime_invalid_scalar and test_datetime_outofbounds_scalar only deal in scalar values, unlike my tests, which deal with lists. The difference from test_datetime_invalid_index lies largely in the data being tested. The list used in test_datetime_invalid_index is entirely made up of values that will not parse, while the lists in my tests always start with a valid date. This is important because infer_datetime_format only becomes relevant on the value after the first valid date, since it would need a format already to infer from. Therefore this test doesn't really check the fallback code as my tests are meant to.

In terms of your comments about the code changes and the first test comment, I agree with those changes and I can do that for sure. Also, in terms of test location, I'm fully open to changing that, since the choice I made in location was largely arbitrary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining and for your attention to detail, that's very clear! Apologies for having missed that, I'd looked through in-line comments but had missed this one in the main thread

For test location - does pandas/tests/arrays/test_datetimes.py::TestToDatetimeInferFormat look good to you?

Copy link
Member

@MarcoGorelli MarcoGorelli Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srotondo have you had a look at #48621?

First, because you've had a close look at this part of the code so I'd really value your input

Second, because if we did go ahead with that, then it'd make the work in this PR unnecessary

Copy link
Contributor Author

@srotondo srotondo Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test location - does pandas/tests/arrays/test_datetimes.py::TestToDatetimeInferFormat look good to you?

To clarify, is this a new section in test_datetimes.py that you are suggesting, because as far as I can tell, this doesn't actually exist yet. Also, if we're going to keep the tests in this file, I guess it doesn't make much sense to change that test you mentioned at the very start since it's in a different file.

Copy link
Member

@MarcoGorelli MarcoGorelli Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to add new tests to the class which

@pytest.mark.parametrize(
"data",
[
["01/01/2011 00:00:00", "01-02-2011 00:00:00", "2011-01-03T00:00:00"],
["Jan/01/2011", "Feb/01/2011", "Mar/01/2011"],
],
)
def test_to_datetime_infer_datetime_format_inconsistent_format(self, cache, data):
ser = Series(np.array(data))
# When the format is inconsistent, infer_datetime_format should just
# fallback to the default parsing
tm.assert_series_equal(
to_datetime(ser, infer_datetime_format=False, cache=cache),
to_datetime(ser, infer_datetime_format=True, cache=cache),
)

belongs to (TestToDatetimeInferFormat)

Co-authored-by: Marco Edward Gorelli <[email protected]>
@MarcoGorelli
Copy link
Member

PDEP4 was accepted, and it looks like the implementation is almost in

Let's close this then

Thanks @srotondo for your PR - you've done great work, and I'm sorry to be closing it, I just think a different direction is a better direction for the future of pandas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Error Reporting Incorrect or improved errors from pandas

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: to_datetime does not raise when errors='raise'.

4 participants