fix: avoid unnecessary API call in QueryJob.result() when job is already finished #1900

tswast · 2024-04-12T17:58:37Z

BEGIN_COMMIT_OVERRIDE
perf: avoid unnecessary API call in QueryJob.result() when job is already finished (#1900)

fix: retry query jobs that fail even with ambiguous jobs.getQueryResults REST errors (#1903, #1900)
END_COMMIT_OVERRIDE

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Maybe fixes internal issue 332850329
Towards restore progress bar for long-running queries google-cloud-python#14476 and migrate to query_and_wait for lower-latency small queries python-bigquery-magics#15 since this loop for waiting for the query to finish will be easier to add a progress bar (if we decide that's needed)
🦕

…ady finished

tswast · 2024-04-12T17:59:27Z

CC @chalmerlowe

tswast · 2024-04-13T14:09:16Z

tests/unit/job/test_query.py

        self.assertEqual(result.location, "asia-northeast1")
        self.assertEqual(result.query_id, "xyz-abc")

-    def test_result_invokes_begins(self):


As of #967 released in google-cloud-bigquery 3.0.0, the _begin method is no longer used for query jobs.

tests/unit/job/test_query.py

…o trigger a job reload

tswast · 2024-04-13T15:17:59Z

google/cloud/bigquery/job/query.py

            timeout=transport_timeout,
        )

-    def _done_or_raise(self, retry=DEFAULT_RETRY, timeout=None):


This was overridden because we wanted result() from the superclass to call jobs.getQueryResults, not just jobs.get (i.e. job.reload() in Python). Now that we aren't using the superclass for result(), this method is no longer necessary.

tswast · 2024-04-13T15:20:46Z

google/cloud/bigquery/job/query.py

-            try:
-                self.reload(retry=retry, timeout=transport_timeout)
-            except exceptions.GoogleAPIError as exc:
-                self.set_exception(exc)


Thought: We probably should have been calling set_exception based on the job status. Need to look into this further.

OK. We are. 😅

python-bigquery/google/cloud/bigquery/job/base.py

Line 909 in bf8861c

self.set_exception(exception)

Which we call from _set_properties

python-bigquery/google/cloud/bigquery/job/base.py

Line 677 in bf8861c

self._set_future_result()

Which we call from reload

python-bigquery/google/cloud/bigquery/job/base.py

Line 838 in bf8861c

self._set_properties(api_response)

tswast · 2024-04-15T13:52:09Z

google/cloud/bigquery/job/query.py

+                # wait for the query to finish. Unlike most methods,
+                # jobs.getQueryResults hangs as long as it can to ensure we
+                # know when the query has finished as soon as possible.
+                self._reload_query_results(retry=retry, timeout=timeout)


Uh oh, if jobs.getQueryResults fails because the job failed it can throw an exception but restart_query_job will still be False.

But we don't want restart_query_job = True because sometimes this can raise an ambiguous exception such as quota exceeded, where we don't know if it's the job quota and it's a failed job or at a higher level (Google Frontend - GFE) where the job might actually still be running and/or succeeded.

This isn't the worst way to fail, but it'd be nice to do the jobs.get call above in case of an exception to get a chance at retrying this job if the job failed.

Filed #1903 to track improvements to ambiguous errors. 12fa9fb fixes an issue where we weren't actually retrying after an ambiguous failure even though we thought we were.

chalmerlowe

I am halfway through my review.
Releasing these comments for now.
Will come back to this to finish out my review as soon as possible.

google/cloud/bigquery/job/query.py

chalmerlowe · 2024-04-16T19:52:38Z

google/cloud/bigquery/job/query.py

+                is_job_done = job_retry(is_job_done)

-            do_get_result()
+            # timeout can be `None` or an object from our superclass


Which superclass are we discussing here?

google.api_core.future.polling.PollingFuture._DEFAULT_VALUE introduced in googleapis/python-api-core#462.

I've updated the comments with some more info as well as some things to consider in case we want to have a default value for timeout in future.

google/cloud/bigquery/retry.py

chalmerlowe · 2024-04-16T19:55:20Z

google/cloud/bigquery/retry.py

 # rateLimitExceeded errors, which can be raised either by the Google load
 # balancer or the BigQuery job server.
-_DEFAULT_JOB_DEADLINE = 3.0 * _DEFAULT_RETRY_DEADLINE
+_DEFAULT_JOB_DEADLINE = 4.0 * _DEFAULT_RETRY_DEADLINE


What is the purpose of using 4.0 here?
Can we get a comment indicating why 4.0?

Updated to 2.0 * (2.0 * _DEFAULT_RETRY_DEADLINE) and added some explanation both here and in QueryJob.result().

Note: This still only gets us 1 query retry in the face of the problematic ambiguous error codes from jobs.getQueryResults() but that's better than the nothing that we were actually getting before in some cases. I don't feel comfortable bumping this much further, though maybe 3.0 * 2.0 * _DEFAULT_RETRY_DEADLINE would be slightly less arbitrary at 1 hour?

google/cloud/bigquery/retry.py

chalmerlowe · 2024-04-16T20:17:00Z

tests/unit/job/test_query.py

        }
        conn = make_connection(
-            query_resource, query_resource_done, job_resource_done, query_page_resource
+            job_resource,


Not sure I am tracking the relationship between the make connection inputs versus the assert_has_calls checks.

Can you explain how these tests are supposed to work?

make_connection is a convention in google-cloud-bigquery unit tests that actually predates our use of the "mock" package. It mocks out the responses to REST API calls, previously with a fake implementation of our "Connection" class from the _http module and now with a true mock object. For every quest that our test makes, there should be a corresponding response. As with Mock.side_effect, any exceptions in this list will be raised, instead.

I'm guessing your question also relates to "Why this particular set of requests / responses?". I've added some comments explaining why we're expecting this sequence of API calls. I've also updated this test to more explicitly check for a possible cause of customer issue b/332850329.

chalmerlowe · 2024-04-16T20:17:56Z

tests/unit/job/test_query.py

        connection.api_request.assert_has_calls(
-            [query_results_call, query_results_call, reload_call]
+            [
+                reload_call,


Same thing here. Can I get some clarity on what we are doing and looking for?

Added some explanation here as well as above in the make_connection() call.

Co-authored-by: Chalmer Lowe <[email protected]>

chalmerlowe

LGTM

tests/unit/job/test_query.py

chalmerlowe · 2024-04-17T17:57:04Z

tests/unit/job/test_query.py


-        job.result()
+        with freezegun.freeze_time("1970-01-01 00:00:00", tick=False):
+            job.result(timeout=1.125)


Is there are reason we are using such a specific number?
1.125.

Can I get a comment here to let future me know why we picked this number?

chalmerlowe · 2024-04-17T17:58:15Z

tests/unit/job/test_query.py

+            method="GET",
+            path=f"/projects/{self.PROJECT}/queries/{self.JOB_ID}",
+            query_params={
+                "maxResults": 0,


Is maxResults of 0 synonymous with asking for all results? OR is it really asking for zero results?

We actually want 0 rows. If we omit this or ask for non-zero number of rows, the jobs.getQueryResults API can hang when the query has wide rows (many columns).

tests/unit/test_job_retry.py

Co-authored-by: Chalmer Lowe <[email protected]>

…ast-fix-query-retry

chalmerlowe

Thanks for all the comments, etc.
Future me thanks you as well.

LGTM, APPROVED.

google/cloud/bigquery/job/query.py

fix: avoid unnecessary API call in QueryJob.result() when job is alre…

0c9a8a2

…ady finished

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery API. labels Apr 12, 2024

tswast changed the title ~~fix: avoid unnecessary API call in QueryJob.result() when job is alre…~~ fix: avoid unnecessary API call in QueryJob.result() when job is already finished Apr 12, 2024

fix most unit tests

07839b5

tswast commented Apr 13, 2024

View reviewed changes

tests/unit/job/test_query.py Show resolved Hide resolved

tswast added 2 commits April 13, 2024 09:36

fix remaining unit tests

5112315

Merge remote-tracking branch 'origin/main' into tswast-fix-query-retry

6297efd

tswast marked this pull request as ready for review April 13, 2024 14:44

tswast requested review from a team as code owners April 13, 2024 14:44

tswast requested a review from obada-ab April 13, 2024 14:44

blunderbuss-gcf bot assigned shollyman Apr 13, 2024

tswast requested a review from chalmerlowe April 13, 2024 14:44

tswast added 2 commits April 13, 2024 10:02

add timeout test, remove unneeded _done_or_raise override

e5990eb

treat even finished jobs as PENDING in query_and_wait since we want t…

fddb557

…o trigger a job reload

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Apr 13, 2024

tswast commented Apr 13, 2024

View reviewed changes

remove unnecessary if statement

6a43cfd

chalmerlowe self-assigned this Apr 15, 2024

tswast commented Apr 15, 2024

View reviewed changes

tswast mentioned this pull request Apr 16, 2024

perf: don't retry getQueryResults as often with ambiguous errors in QueryJob.result() #1903

Open

fix: retry query job after ambiguous failures

12fa9fb

chalmerlowe requested changes Apr 16, 2024

View reviewed changes

Apply suggestions from code review

5149c25

Co-authored-by: Chalmer Lowe <[email protected]>

chalmerlowe requested changes Apr 17, 2024

View reviewed changes

tswast and others added 4 commits April 17, 2024 13:37

add comments explaining expected sequences of API calls

4ee4975

explain timeout choice

08373bc

Apply suggestions from code review

2a587aa

Co-authored-by: Chalmer Lowe <[email protected]>

Merge remote-tracking branch 'origin/tswast-fix-query-retry' into tsw…

f7f1e81

…ast-fix-query-retry

tswast requested a review from chalmerlowe April 17, 2024 19:34

chalmerlowe approved these changes Apr 18, 2024

View reviewed changes

google/cloud/bigquery/job/query.py Outdated Show resolved Hide resolved

Update google/cloud/bigquery/job/query.py

e42d8ad

tswast merged commit 1367b58 into main Apr 18, 2024

tswast deleted the tswast-fix-query-retry branch April 18, 2024 14:31

release-please bot mentioned this pull request Apr 18, 2024

chore(main): release 3.21.0 #1883

Merged

colin-rogers-dbt mentioned this pull request Apr 19, 2024

Preserve nicely formatted timeout exception dbt-labs/dbt-bigquery#1187

Merged

4 tasks

This was referenced Apr 23, 2024

April 22, 2024 kitta65/bq-extension-vscode#374

Closed

April 22, 2024 kitta65/prettier-plugin-bq#364

Closed

April 22, 2024 kitta65/bq2cst#349

Closed

pankajastro mentioned this pull request May 7, 2024

Adjust test for Airflow-2.9 astronomer/astro-sdk#2149

Merged

tswast mentioned this pull request May 22, 2024

Retry requests.exceptions.ConnectionError #1929

Closed

tswast mentioned this pull request Jun 3, 2024

QueryJob class break flow after upgrade to 3.21.0 #1940

Closed

Linchin mentioned this pull request Jun 4, 2024

fix: create query job in job.result() if doesn't exist #1944

Merged

fix: avoid unnecessary API call in QueryJob.result() when job is already finished #1900

fix: avoid unnecessary API call in QueryJob.result() when job is already finished #1900

Uh oh!

Conversation

tswast commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tswast commented Apr 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswast Apr 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chalmerlowe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chalmerlowe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chalmerlowe left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tswast commented Apr 12, 2024 •

edited

Loading

tswast Apr 13, 2024 •

edited

Loading

chalmerlowe left a comment •

edited

Loading