-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-4785] [SQL] Support udf instance ser/de after initialization #3640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #24237 has started for PR 3640 at commit
|
|
Test build #24238 has started for PR 3640 at commit
|
|
Test build #24237 has finished for PR 3640 at commit
|
|
Test FAILed. |
|
Test build #24238 has finished for PR 3640 at commit
|
|
Test FAILed. |
|
Test build #24240 has started for PR 3640 at commit
|
|
Test build #24240 has finished for PR 3640 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HiveFunctionWrapper might be a better name, since essentially this class is just used to make Hive function objects serializable. And traditionally Spark uses "wrapper" as suffix for classes with similar purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good suggestion, will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment, related to HiveShim. I was thinking to move this class rather than de/serializePlan methods into the shim layer. Hive 0.12.0 is not affected by SPARK-4785, thus the version in 0.12.0 shim can be very simple. We only need to handle 0.13.1 there. This also lowers the possibility of breaking 0.12.0 code paths.
Also after moving this class into the shim layer, as I've mentioned in another comment, instead of relying on de/serializePlan, we can just mimic Utilities.de/serializeObjectByKryo in read/writeExternal methods in this class.
|
Appreciate a lot for fixing this case! The serialization wrapper class makes sense. However, would like to make some refactoring. A summary of my comments above:
In a word, we can add two classes, |
|
Test build #24247 has started for PR 3640 at commit
|
|
Test build #24248 has started for PR 3640 at commit
|
|
Test build #24247 has finished for PR 3640 at commit
|
|
Test PASSed. |
|
Test build #24248 has finished for PR 3640 at commit
|
|
Test PASSed. |
|
The current design of this PR is derived from some background knowledges. I'd like to provide a brief summary here for future reference. As mentioned in the PR description, different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. |
…m with a wrapper Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance. Author: Cheng Hao <[email protected]> Author: Cheng Lian <[email protected]> Closes #3640 from chenghao-intel/udf_serde and squashes the following commits: 8e13756 [Cheng Hao] Update the comment 74466a3 [Cheng Hao] refactor as feedbacks 396c0e1 [Cheng Hao] avoid Simple UDF to be serialized e9c3212 [Cheng Hao] update the comment 19cbd46 [Cheng Hao] support udf instance ser/de after initialization (cherry picked from commit 383c555) Signed-off-by: Michael Armbrust <[email protected]>
|
Thanks a lot guys for digging into this! Merged to master and branch 1.2 |
UDF contract change in Hive 0.13.1. In Hive 0.12.0, it's always safe to construct and initialize a fresh UDF object on worker side, while in Hive 0.13.1, UDF objects should only be initialized on driver side and then serialized to the worker side. We provide the ability to serialize the initialized UDF instance and deserialize them cross process boundary.