Skip to content

Conversation

jl982
Copy link
Owner

@jl982 jl982 commented Sep 18, 2020

There are 2 commits in this PR, the first commit backports the old prototype from Liang-Chi (apache#15178) to the master branch, while the second commit contains my modifications. Aside from logging and naming changes, what I mainly did was to 1) refactor TorrentBroadcast to a superclass and have TorrentDriverBroadcast and TorrentExecutorBroadcast as subclasses, and 2) change BroadcastHashJoinExec such that broadcasted value does not unintentionally get sent back to the driver.

Outstanding items I can think of right now:

  1. Implement size estimation for executor side broadcast (ebc) (currently it returns Long.MaxValue)
  2. Understand if broadcast size limitation for driver side broadcast (dbc) - less of 512 million rows or 8GB - needs to be enforced also for ebc
  3. Handle what happens to ebc when executors are added or lost, especially if all executors are dead
  4. Verify that canceling works for ebc
  5. Run more performance tests comparing dbc and ebc (eg. ebc seems to be slower when there are more executors)

@PavithraRamachandran
Copy link

hi @jl982 we are currently trying to work on executor broadcast. Will it be possible to discuss with you in more detail? could you share your email id ?

@jl982
Copy link
Owner Author

jl982 commented Feb 23, 2022

@PavithraRamachandran Great to hear that. I haven’t looked at this code in a while, but I’m happy to help out however I can in this thread

@iRakson
Copy link

iRakson commented Feb 25, 2022

@jl982 When we took this code and incorporated to our fork, we are able to get desired results with one executor. But in case of multiple executors, executor side broadcast is performing poorer than sort merge join itself.
You have also mentioned that with multiple executor there is slight degradation. Did you found any solution for that?

@jl982
Copy link
Owner Author

jl982 commented Feb 27, 2022

@iRakson Right, I do remember seeing performance degradation with more executors. But unfortunately, I never had time to investigate the cause. I recommend that you look into the logs, and understand how much time was taken by the broadcast/receive/hashmap construction phase, versus performing the actual join. I suspect that in the multi executor case, the current code has some additional overhead in the broadcast phase that needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants