Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,39 @@ BigQuery ETL
===

Bigquery UDFs and SQL queries for building derived datasets.

Recommended practices
===

- Should name sql files like `sql/destination_table_with_version.sql` e.g.
`sql/clients_daily_v6.sql`
- Should not specify a project or dataset in table names to simplify testing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know at this point what the hierarchy of projects, datasets, and tables is going to look like? Will these derived tables live in the same project and dataset as the source data?

With GCP ingestion so far, we're splitting tables to different datasets based on document namespace. We would need to change that practice to meet this requirement.

There are implications for permissions, testing, etc. that I haven't fully thought through yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know, and dataset per document namespace seems good to me. this has lots of implications, but if we can avoid depending on a static dataset name then we only need unique datasets per test instead of unique projects in order to run tests in parallel.

i think this is fine for queries that only read one input table (hence should not must), because output dataset can be specified separately from default dataset. For queries that need to read multiple tables from multiple datasets, I think for now we can just assume their either run in series or require multiple projects. The first time we need that we can consider solutions like templating dataset names for testing and adding a recommendation here to follow the chosen solution

- Should use incremental queries
- Should filter input tables on partition and clustering columns
- Should use UDF language `SQL` over `js` for performance
- Should use UDFs for reusability
- Should use query parameters over jinja templating
- Temporary issue: Airflow 1.10+ is required in order to use query parameters

Incremental Queries
===

Incremental queries have these benefits:

- BigQuery billing discounts for destination table partitions not modified in
the last 90 days
- Requires less airflow configuration
- Will have tooling to automate backfilling
- Will have tooling to replace partitions atomically to prevent duplicate data
- Will have tooling to generate an optimized "destination plus" view that
calculates the most recent partition

Incremental queries have these properties:

- Must accept a date via `@submission_date` query parameter
- Must output a column named `submission_date` matching the query parameter
- Must produce similar results when run multiple times
- Should produce identical results when run multiple times
- May depend on the previous partition
- If using previous partition, must include a `.init.sql` query to init the
first partition
17 changes: 17 additions & 0 deletions sql/clients_last_seen_v1.init.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
SELECT
@submission_date AS submission_date,
CURRENT_DATETIME() AS generated_time,
MAX(submission_date_s3) AS last_seen_date,
-- approximate LAST_VALUE(input).*
ARRAY_AGG(input
ORDER BY submission_date_s3
DESC LIMIT 1
)[OFFSET(0)].* EXCEPT (submission_date_s3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fascinating. I like this better than having to use a ROW_NUMBER window function and then select n = 1.

Copy link
Collaborator Author

@relud relud Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could alternately ANY_VALUE(LAST_VALUE(input) OVER (PARTITION BY client_id ORDER BY submission_date_s3)), but i don't know the performance implications of that

Copy link
Collaborator Author

@relud relud Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i don't know the performance implications of that

I decided to check and it's not as simple as above, but using a window is so much faster it hurts (runs in ~1/6th of the time and uses ~1/8th of the compute)

FROM
clients_daily_v6 AS input
WHERE
submission_date_s3 <= @submission_date
AND
submission_date_s3 > DATE_SUB(@submission_date, INTERVAL 28 DAY)
GROUP BY
input.client_id
30 changes: 30 additions & 0 deletions sql/clients_last_seen_v1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
WITH current_sample AS (
SELECT
submission_date_s3 AS last_seen_date,
* EXCEPT (submission_date_s3)
FROM
clients_daily_v6
WHERE
submission_date_s3 = @submission_date
), previous AS (
SELECT
* EXCEPT (submission_date,
generated_time)
FROM
analysis.clients_last_seen_v1
WHERE
submission_date = DATE_SUB(@submission_date, INTERVAL 1 DAY)
AND last_seen_date > DATE_SUB(@submission_date, INTERVAL 28 DAY)
)
SELECT
@submission_date AS submission_date,
CURRENT_DATETIME() AS generated_time,
IF(current_sample.client_id IS NOT NULL,
current_sample,
previous).*
FROM
current_sample
FULL JOIN
previous
USING
(client_id)
24 changes: 24 additions & 0 deletions sql/firefox_desktop_exact_mau28_by_dimensions_v1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
SELECT
submission_date,
CURRENT_DATETIME() AS generated_time,
COUNT(*) AS mau,
COUNTIF(last_seen_date = submission_date) AS dau,
-- requested fields from bug 1525689
source,
medium,
campaign,
content,
country,
distribution_id
FROM
clients_last_seen_v1
WHERE
submission_date = @submission_date
GROUP BY
submission_date,
source,
medium,
campaign,
content,
country,
distribution_id
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@ SELECT
@submission_date AS submission_date,
CURRENT_DATETIME() AS generated_time,
COUNT(DISTINCT client_id) AS mau,
SUM(CAST(submission_date_s3 = @submission_date AS INT64)) as dau
COUNTIF(submission_date_s3 = @submission_date) AS dau
FROM
telemetry.clients_daily_v6
clients_daily_v6
WHERE
submission_date_s3 <= @submission_date
AND submission_date_s3 > DATE_ADD(@submission_date, INTERVAL -28 DAY)
AND submission_date_s3 > DATE_SUB(@submission_date, INTERVAL 28 DAY)