Skip to content

Conversation

@chenghao-intel
Copy link
Contributor

UDF contract change in Hive 0.13.1. In Hive 0.12.0, it's always safe to construct and initialize a fresh UDF object on worker side, while in Hive 0.13.1, UDF objects should only be initialized on driver side and then serialized to the worker side. We provide the ability to serialize the initialized UDF instance and deserialize them cross process boundary.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24237 has started for PR 3640 at commit 19cbd46.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24238 has started for PR 3640 at commit e9c3212.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24237 has finished for PR 3640 at commit 19cbd46.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HiveFunctionCache(var functionClassName: String) extends java.io.Externalizable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24237/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24238 has finished for PR 3640 at commit e9c3212.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HiveFunctionCache(var functionClassName: String) extends java.io.Externalizable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24238/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24240 has started for PR 3640 at commit 396c0e1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24240 has finished for PR 3640 at commit 396c0e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HiveFunctionCache(var functionClassName: String) extends java.io.Externalizable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24240/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveFunctionWrapper might be a better name, since essentially this class is just used to make Hive function objects serializable. And traditionally Spark uses "wrapper" as suffix for classes with similar purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good suggestion, will update it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment, related to HiveShim. I was thinking to move this class rather than de/serializePlan methods into the shim layer. Hive 0.12.0 is not affected by SPARK-4785, thus the version in 0.12.0 shim can be very simple. We only need to handle 0.13.1 there. This also lowers the possibility of breaking 0.12.0 code paths.

Also after moving this class into the shim layer, as I've mentioned in another comment, instead of relying on de/serializePlan, we can just mimic Utilities.de/serializeObjectByKryo in read/writeExternal methods in this class.

@liancheng
Copy link
Contributor

Appreciate a lot for fixing this case! The serialization wrapper class makes sense. However, would like to make some refactoring. A summary of my comments above:

  1. Renaming HiveFunctionCache to HiveFunctionWrapper

  2. Moving HiveFunctionWrapper to the shim layer, and keep Hive 0.12.0 code path exactly the same as before.

    Considering Hive 0.12.0 is not affected by this issue, and 1.2.0 release is really close, I'd like to lower the possibility of breaking 0.12.0 code paths as much as possible.

  3. Add a HiveSimpleUdfWrapper, which inherits from HiveFunctionWrapper.

    As you've mentioned in the code, HiveSimpleUdf is a special case, it shouldn't be serialized, and should always create a fresh object on executor side. Currently this special case complicates HiveFunctionWrapper implementation and makes it somewhat confusing. Defining a special subclass for HvieSimpleUdf helps making HiveFunctionWrapper simpler (e.g. no need to serialize the boolean flag any more).

  4. Avoid using Utilities.de/serializePlan by mimicking Utilities.de/serializeObjectByKryo in the read/writeExternal methods of the wrapper class, so that we can remove the expensive HiveConf instantiation.

In a word, we can add two classes, HiveFunctionWrapper and HiveSImpleUdfWrapper into the shim layer, and make sure that the 0.12.0 version behaves exactly the same as before.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24247 has started for PR 3640 at commit 74466a3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24248 has started for PR 3640 at commit 8e13756.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24247 has finished for PR 3640 at commit 74466a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HiveFunctionWrapper(var functionClassName: String) extends java.io.Serializable
    • class HiveFunctionWrapper(var functionClassName: String) extends java.io.Externalizable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24247/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24248 has finished for PR 3640 at commit 8e13756.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HiveFunctionWrapper(var functionClassName: String) extends java.io.Serializable
    • class HiveFunctionWrapper(var functionClassName: String) extends java.io.Externalizable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24248/
Test PASSed.

@liancheng
Copy link
Contributor

The current design of this PR is derived from some background knowledges. I'd like to provide a brief summary here for future reference.

As mentioned in the PR description, different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance.

asfgit pushed a commit that referenced this pull request Dec 9, 2014
…m with a wrapper

Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance.

Author: Cheng Hao <[email protected]>
Author: Cheng Lian <[email protected]>

Closes #3640 from chenghao-intel/udf_serde and squashes the following commits:

8e13756 [Cheng Hao] Update the comment
74466a3 [Cheng Hao] refactor as feedbacks
396c0e1 [Cheng Hao] avoid Simple UDF to be serialized
e9c3212 [Cheng Hao] update the comment
19cbd46 [Cheng Hao] support udf instance ser/de after initialization

(cherry picked from commit 383c555)
Signed-off-by: Michael Armbrust <[email protected]>
@marmbrus
Copy link
Contributor

marmbrus commented Dec 9, 2014

Thanks a lot guys for digging into this! Merged to master and branch 1.2

@asfgit asfgit closed this in 383c555 Dec 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants