Skip to content

Conversation

@jerrypeng
Copy link
Contributor

@jerrypeng jerrypeng commented Oct 19, 2022

What changes were proposed in this pull request?

Purging old entries in both the offset log and commit log will be done asynchronously.

For every micro-batch, older entries in both offset log and commit log are deleted. This is done so that the offset log and commit log do not continually grow. Please reference logic here

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539

The time spent performing these log purges is grouped with the “walCommit” execution time in the StreamingProgressListener metrics. Around two thirds of the “walCommit” execution time is performing these purge operations thus making these operations asynchronous will also reduce latency. Also, we do not necessarily need to perform the purges every micro-batch. When these purges are executed asynchronously, they do not need to block micro-batch execution and we don’t need to start another purge until the current one is finished. The purges can happen essentially in the background. We will just have to synchronize the purges with the offset WAL commits and completion commits so that we don’t have concurrent modifications of the offset log and commit log.

Why are the changes needed?

Decrease microbatch processing latency

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

@HeartSaVioR
Copy link
Contributor

cc. @zsxwing @xuanyuanking @viirya Appreciate your reviews. Thanks!

@HeartSaVioR
Copy link
Contributor

I'll find a time to review in tomorrow.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Let me wait for a couple more days to seek more eyes of review. I'll merge this in early next week if there is no outstanding comment.

@jerrypeng
Copy link
Contributor Author

thanks @HeartSaVioR !

@zhengruifeng
Copy link
Contributor

the failed python code gen check is unrelated to this PR, please rebase to make CI green

@HeartSaVioR
Copy link
Contributor

https://github.com/jerrypeng/spark/actions/runs/3293047872/jobs/5447874146

Remaining steps are unrelated to this PR - only license check which is respected in this PR.

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

Purging old entries in both the offset log and commit log will be done asynchronously.

For every micro-batch, older entries in both offset log and commit log are deleted. This is done so that the offset log and commit log do not continually grow.  Please reference logic here

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L539

The time spent performing these log purges is grouped with the “walCommit” execution time in the StreamingProgressListener metrics.  Around two thirds of the “walCommit” execution time is performing these purge operations thus making these operations asynchronous will also reduce latency.  Also, we do not necessarily need to perform the purges every micro-batch.  When these purges are executed asynchronously, they do not need to block micro-batch execution and we don’t need to start another purge until the current one is finished.  The purges can happen essentially in the background.  We will just have to synchronize the purges with the offset WAL commits and completion commits so that we don’t have concurrent modifications of the offset log and commit log.

### Why are the changes needed?

Decrease microbatch processing latency

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes apache#38313 from jerrypeng/SPARK-40849.

Authored-by: Jerry Peng <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants