Support Parallel Processing of Window Functions. #1216

avamingli · 2025-07-07T08:31:41Z

avamingli
Jul 7, 2025
Collaborator

Description

Building upon PostgreSQL and Greenplum, Cloudberry has implemented numerous parallel processing capabilities that significantly enhance query performance. Based on real customer requirements, there is an urgent need to support parallel processing of window functions, as most production environments contain SQL statements utilizing window functions. Window functions are critically important and widely used in complex production scenarios.

In PostgreSQL, window functions cannot be parallelized and can only be executed in the leader process of parallel plan. This limitation exists because PostgreSQL, as a single-machine database, lacks the concept of data distribution. When processing window functions, subnodes cannot guarantee outputting data to upper nodes according to PARTITION BY clauses.

In our distributed environment, parallel scan operators compete randomly between processes to read data. Therefore, even with distribution keys, data appears randomly distributed from the perspective of upper nodes.

Examining MPP distribution characteristics, when window functions contain PARTITION BY clauses, their semantics are very similar to GROUP BY. We can leverage the redistribution feature of Motion to enable parallel processing of window functions based on different PARTITION BY keys.

Furthermore, by implementing parallel processing of window functions, we can integrate their parallel paths with other operators, enabling end-to-end parallelization of entire SQL queries. Combined with our existing support for parallel Join, Union, and DISTINCT operations, this would further optimize complex queries like TPC-DS Q98.

WITH customer_data AS (
    SELECT DISTINCT
        c_customer_sk,
        c_first_name,
        c_last_name,
        cd_demo_sk
    FROM 
        customer,
        customer_demographics
    WHERE 
        c_current_cdemo_sk = cd_demo_sk
),
ranked_items AS (
    SELECT 
        cs_item_sk,
        cs_order_number,
        ROW_NUMBER() OVER (PARTITION BY cs_order_number ORDER BY cs_quantity DESC) AS rank
    FROM 
        catalog_sales
    WHERE 
        cs_sold_date_sk BETWEEN 2451545 AND 2451910
)
SELECT DISTINCT
    cd.c_first_name,
    cd.c_last_name,
    ri.cs_item_sk
FROM 
    customer_data cd
JOIN 
    ranked_items ri ON cd.c_customer_sk = ri.cs_bill_customer_sk
WHERE 
    ri.rank <= 3
UNION ALL
SELECT DISTINCT
    cd.c_first_name,
    cd.c_last_name,
    ws.ws_item_sk
FROM 
    customer_data cd
JOIN 
    web_sales ws ON cd.c_customer_sk = ws.ws_bill_customer_sk
WHERE 
    ws.ws_sold_date_sk BETWEEN 2451545 AND 2451910
ORDER BY 
    c_last_name, c_first_name;

Use case/motivation

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

cnliuchong · 2025-07-07T09:49:21Z

cnliuchong
Jul 7, 2025

When large amounts of data, the window function is performance low. If parallel execution is possible, the performance will be greatly improved.

0 replies

my-ship-it · 2025-07-08T06:12:28Z

my-ship-it
Jul 8, 2025
Collaborator

Leverage parallel approach of MPP to implent executor parallelism seems to be a viable path. In fact, because Cloudberry data has distribution information, we can implement executor parallelism more easily than Postgres?

1 reply

avamingli Jul 8, 2025
Collaborator Author

Yes, and the focus is more on optimizer changes, specifically leveraging Motion to redistribute data based on the Partition By clauses of window functions.
Once the plan is established, each process executes its plan independently, so there won't be significant changes to the executor itself.

avamingli · 2025-07-16T06:58:54Z

avamingli
Jul 16, 2025
Collaborator Author

During develop, I found that the result of Window Agg without Order By clause is unstable.

Referring to the SQL 2011 standard, it states that if ORDER BY is omitted, the order of rows in the partition is undefined.
While using a window function without ORDER BY is valid, the resulting output seems unpredictable.

SELECT sum(unique1) OVER (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
 unique1, four
FROM tenk1
WHERE unique1 < 10;

The case is in window.sql of regression.

explain(costs off) SELECT sum(unique1) over (rows between current row and unbounded following),
 unique1, four
FROM tenk1 WHERE unique1 < 10;
 QUERY PLAN
--------------------------------------------------------------------
 WindowAgg
 Window: w1 AS (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
 -> Index Scan using tenk1_unique1 on tenk1
 Index Cond: (unique1 < 10)
(4 rows)

regression=# SELECT sum(unique1) over (rows between current row and unbounded following),
 unique1, four
FROM tenk1 WHERE unique1 < 10;
 sum | unique1 | four
-----+---------+------
 45 | 0 | 0
 45 | 1 | 1
 44 | 2 | 2
 42 | 3 | 3
 39 | 4 | 0
 35 | 5 | 1
 30 | 6 | 2
 24 | 7 | 3
 17 | 8 | 0
 9 | 9 | 1
(10 rows)

However, after setting enable_indexscan = off, the results changed:

regression=# set enable_indexscan = off;
SET
regression=# SELECT sum(unique1) over (rows between current row and unbounded following),
 unique1, four
FROM tenk1 WHERE unique1 < 10;
 sum | unique1 | four
-----+---------+------
 45 | 4 | 0
 41 | 2 | 2
 39 | 1 | 1
 38 | 6 | 2
 32 | 9 | 1
 23 | 8 | 0
 15 | 5 | 1
 10 | 3 | 3
 7 | 7 | 3
 0 | 0 | 0
(10 rows)

regression=# explain(costs off) SELECT sum(unique1) over (rows between current row and unbounded following),
 unique1, four
FROM tenk1 WHERE unique1 < 10;
 QUERY PLAN
--------------------------------------------------------------------
 WindowAgg
 Window: w1 AS (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
 -> Seq Scan on tenk1
 Filter: (unique1 < 10)
(4 rows)

And the parallel process of Window function make it worse.
However, both results are correct.

To pass parallel test cases, we need to modify the SQL in that case.

4 replies

avamingli Jul 16, 2025
Collaborator Author

To pass parallel test cases, we need to modify the SQL in that case.

Postgres discussion: https://www.postgresql.org/message-id/flat/fbb5c0d7-4a96-4dd1-9a26-5dfccfac667a%40Spark

AFAK, disabling parallel execution for window aggregates without a ORDER BY clause (When there is no Partition By at the same time) is the preferred approach.
While the unstable results are technically valid per SQL standards, proactively avoiding this behavior prevents customer confusion between actual bugs and intended SQL standard behavior.

@my-ship-it , @jianlirong mind sharing your perspectives on this?

jianlirong Jul 16, 2025

In my personal opinion, we should modify the corresponding SQL statement and add ORDER BY. Although we're discussing window functions here, the issue reflected in this example is essentially no different from the following statement:

SELECT unique1 FROM tenk1 WHERE unique1 < 10;

Under a MPP-style database, the results of the above query will not be the same if we run it multiple times. That totally depends on the order when tuples are arriving at the master node. The only way to make the result stable is just to add ORDER BY. That's why I think we should do the same for the window function query under discussion.

leborchuk Jul 16, 2025
Collaborator

During develop, I found that the result of Window Agg without Order By clause is unstable.

Referring to the SQL 2011 standard, it states that if ORDER BY is omitted, the order of rows in the partition is undefined. While using a window function without ORDER BY is valid, the resulting output seems unpredictable.

To pass parallel test cases, we need to modify the SQL in that case.

I have faced this issue many times, not only in CBDB but also in Postgres and Oracle.

People write SQL queries with window functions without ordering by, then filter the output (the most common case being getting the maximum value) and report that the database has given the wrong result. But SQL is wrong, not the answer.

I believe we need to modify our test cases.

avamingli Jul 17, 2025
Collaborator Author

@leborchuk
Thank you so much for sharing your insightful observations and experiences.
I will make the necessary adjustments to ensure our tests are robust.

A preliminary test indicates that there are not many failures.

=========================
 13 of 700 tests failed.
=========================

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Parallel Processing of Window Functions. #1216

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support Parallel Processing of Window Functions. #1216

Uh oh!

avamingli Jul 7, 2025 Collaborator

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Replies: 3 comments · 5 replies

Uh oh!

cnliuchong Jul 7, 2025

Uh oh!

my-ship-it Jul 8, 2025 Collaborator

Uh oh!

avamingli Jul 8, 2025 Collaborator Author

Uh oh!

avamingli Jul 16, 2025 Collaborator Author

Uh oh!

avamingli Jul 16, 2025 Collaborator Author

Uh oh!

jianlirong Jul 16, 2025

Uh oh!

leborchuk Jul 16, 2025 Collaborator

Uh oh!

avamingli Jul 17, 2025 Collaborator Author

avamingli
Jul 7, 2025
Collaborator

Replies: 3 comments 5 replies

cnliuchong
Jul 7, 2025

my-ship-it
Jul 8, 2025
Collaborator

avamingli Jul 8, 2025
Collaborator Author

avamingli
Jul 16, 2025
Collaborator Author

avamingli Jul 16, 2025
Collaborator Author

leborchuk Jul 16, 2025
Collaborator

avamingli Jul 17, 2025
Collaborator Author