-
Notifications
You must be signed in to change notification settings - Fork 124
version queries and add clients_last_seen and mau28_by_dimensions #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| SELECT | ||
| @submission_date AS submission_date, | ||
| CURRENT_DATETIME() AS generated_time, | ||
| MAX(submission_date_s3) AS last_seen_date, | ||
| -- approximate LAST_VALUE(input).* | ||
| ARRAY_AGG(input | ||
| ORDER BY submission_date_s3 | ||
| DESC LIMIT 1 | ||
| )[OFFSET(0)].* EXCEPT (submission_date_s3) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is fascinating. I like this better than having to use a ROW_NUMBER window function and then select
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we could alternately
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I decided to check and it's not as simple as above, but using a window is so much faster it hurts (runs in ~1/6th of the time and uses ~1/8th of the compute) |
||
| FROM | ||
| clients_daily_v6 AS input | ||
| WHERE | ||
| submission_date_s3 <= @submission_date | ||
| AND | ||
| submission_date_s3 > DATE_SUB(@submission_date, INTERVAL 28 DAY) | ||
| GROUP BY | ||
| input.client_id | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| WITH current_sample AS ( | ||
| SELECT | ||
| submission_date_s3 AS last_seen_date, | ||
| * EXCEPT (submission_date_s3) | ||
| FROM | ||
| clients_daily_v6 | ||
| WHERE | ||
| submission_date_s3 = @submission_date | ||
| ), previous AS ( | ||
| SELECT | ||
| * EXCEPT (submission_date, | ||
| generated_time) | ||
| FROM | ||
| analysis.clients_last_seen_v1 | ||
| WHERE | ||
| submission_date = DATE_SUB(@submission_date, INTERVAL 1 DAY) | ||
| AND last_seen_date > DATE_SUB(@submission_date, INTERVAL 28 DAY) | ||
| ) | ||
| SELECT | ||
| @submission_date AS submission_date, | ||
| CURRENT_DATETIME() AS generated_time, | ||
| IF(current_sample.client_id IS NOT NULL, | ||
| current_sample, | ||
| previous).* | ||
| FROM | ||
| current_sample | ||
| FULL JOIN | ||
| previous | ||
| USING | ||
| (client_id) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| SELECT | ||
| submission_date, | ||
| CURRENT_DATETIME() AS generated_time, | ||
| COUNT(*) AS mau, | ||
| COUNTIF(last_seen_date = submission_date) AS dau, | ||
| -- requested fields from bug 1525689 | ||
| source, | ||
| medium, | ||
| campaign, | ||
| content, | ||
| country, | ||
| distribution_id | ||
| FROM | ||
| clients_last_seen_v1 | ||
| WHERE | ||
| submission_date = @submission_date | ||
| GROUP BY | ||
| submission_date, | ||
| source, | ||
| medium, | ||
| campaign, | ||
| content, | ||
| country, | ||
| distribution_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know at this point what the hierarchy of projects, datasets, and tables is going to look like? Will these derived tables live in the same project and dataset as the source data?
With GCP ingestion so far, we're splitting tables to different datasets based on document namespace. We would need to change that practice to meet this requirement.
There are implications for permissions, testing, etc. that I haven't fully thought through yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know, and dataset per document namespace seems good to me. this has lots of implications, but if we can avoid depending on a static dataset name then we only need unique datasets per test instead of unique projects in order to run tests in parallel.
i think this is fine for queries that only read one input table (hence
shouldnotmust), because output dataset can be specified separately from default dataset. For queries that need to read multiple tables from multiple datasets, I think for now we can just assume their either run in series or require multiple projects. The first time we need that we can consider solutions like templating dataset names for testing and adding a recommendation here to follow the chosen solution