-
Notifications
You must be signed in to change notification settings - Fork 321
fix: avoid unnecessary API call in QueryJob.result() when job is already finished #1900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC @chalmerlowe |
| self.assertEqual(result.location, "asia-northeast1") | ||
| self.assertEqual(result.query_id, "xyz-abc") | ||
|
|
||
| def test_result_invokes_begins(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of #967 released in google-cloud-bigquery 3.0.0, the _begin method is no longer used for query jobs.
| timeout=transport_timeout, | ||
| ) | ||
|
|
||
| def _done_or_raise(self, retry=DEFAULT_RETRY, timeout=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was overridden because we wanted result() from the superclass to call jobs.getQueryResults, not just jobs.get (i.e. job.reload() in Python). Now that we aren't using the superclass for result(), this method is no longer necessary.
| try: | ||
| self.reload(retry=retry, timeout=transport_timeout) | ||
| except exceptions.GoogleAPIError as exc: | ||
| self.set_exception(exc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: We probably should have been calling set_exception based on the job status. Need to look into this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. We are. 😅
| self.set_exception(exception) |
Which we call from _set_properties
| self._set_future_result() |
Which we call from reload
| self._set_properties(api_response) |
| # wait for the query to finish. Unlike most methods, | ||
| # jobs.getQueryResults hangs as long as it can to ensure we | ||
| # know when the query has finished as soon as possible. | ||
| self._reload_query_results(retry=retry, timeout=timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh oh, if jobs.getQueryResults fails because the job failed it can throw an exception but restart_query_job will still be False.
But we don't want restart_query_job = True because sometimes this can raise an ambiguous exception such as quota exceeded, where we don't know if it's the job quota and it's a failed job or at a higher level (Google Frontend - GFE) where the job might actually still be running and/or succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the worst way to fail, but it'd be nice to do the jobs.get call above in case of an exception to get a chance at retrying this job if the job failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chalmerlowe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am halfway through my review.
Releasing these comments for now.
Will come back to this to finish out my review as soon as possible.
google/cloud/bigquery/job/query.py
Outdated
| is_job_done = job_retry(is_job_done) | ||
|
|
||
| do_get_result() | ||
| # timeout can be `None` or an object from our superclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which superclass are we discussing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google.api_core.future.polling.PollingFuture._DEFAULT_VALUE introduced in googleapis/python-api-core#462.
I've updated the comments with some more info as well as some things to consider in case we want to have a default value for timeout in future.
google/cloud/bigquery/retry.py
Outdated
| # rateLimitExceeded errors, which can be raised either by the Google load | ||
| # balancer or the BigQuery job server. | ||
| _DEFAULT_JOB_DEADLINE = 3.0 * _DEFAULT_RETRY_DEADLINE | ||
| _DEFAULT_JOB_DEADLINE = 4.0 * _DEFAULT_RETRY_DEADLINE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of using 4.0 here?
Can we get a comment indicating why 4.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to 2.0 * (2.0 * _DEFAULT_RETRY_DEADLINE) and added some explanation both here and in QueryJob.result().
Note: This still only gets us 1 query retry in the face of the problematic ambiguous error codes from jobs.getQueryResults() but that's better than the nothing that we were actually getting before in some cases. I don't feel comfortable bumping this much further, though maybe 3.0 * 2.0 * _DEFAULT_RETRY_DEADLINE would be slightly less arbitrary at 1 hour?
| } | ||
| conn = make_connection( | ||
| query_resource, query_resource_done, job_resource_done, query_page_resource | ||
| job_resource, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I am tracking the relationship between the make connection inputs versus the assert_has_calls checks.
Can you explain how these tests are supposed to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_connection is a convention in google-cloud-bigquery unit tests that actually predates our use of the "mock" package. It mocks out the responses to REST API calls, previously with a fake implementation of our "Connection" class from the _http module and now with a true mock object. For every quest that our test makes, there should be a corresponding response. As with Mock.side_effect, any exceptions in this list will be raised, instead.
I'm guessing your question also relates to "Why this particular set of requests / responses?". I've added some comments explaining why we're expecting this sequence of API calls. I've also updated this test to more explicitly check for a possible cause of customer issue b/332850329.
| connection.api_request.assert_has_calls( | ||
| [query_results_call, query_results_call, reload_call] | ||
| [ | ||
| reload_call, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here. Can I get some clarity on what we are doing and looking for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some explanation here as well as above in the make_connection() call.
Co-authored-by: Chalmer Lowe <[email protected]>
chalmerlowe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tests/unit/job/test_query.py
Outdated
|
|
||
| job.result() | ||
| with freezegun.freeze_time("1970-01-01 00:00:00", tick=False): | ||
| job.result(timeout=1.125) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there are reason we are using such a specific number?
1.125.
Can I get a comment here to let future me know why we picked this number?
| method="GET", | ||
| path=f"/projects/{self.PROJECT}/queries/{self.JOB_ID}", | ||
| query_params={ | ||
| "maxResults": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is maxResults of 0 synonymous with asking for all results? OR is it really asking for zero results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually want 0 rows. If we omit this or ask for non-zero number of rows, the jobs.getQueryResults API can hang when the query has wide rows (many columns).
Co-authored-by: Chalmer Lowe <[email protected]>
…ast-fix-query-retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the comments, etc.
Future me thanks you as well.
LGTM, APPROVED.
BEGIN_COMMIT_OVERRIDE
perf: avoid unnecessary API call in
QueryJob.result()when job is already finished (#1900)fix: retry query jobs that fail even with ambiguous
jobs.getQueryResultsREST errors (#1903, #1900)END_COMMIT_OVERRIDE
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
query_and_waitfor lower-latency small queries python-bigquery-magics#15 since this loop for waiting for the query to finish will be easier to add a progress bar (if we decide that's needed)