-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SPARK-1875]:NoClassDefFoundError: StringUtils when building agains... #820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
@pwendell |
|
Oh, I'm sorry. |
|
Jenkins, ok to test. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
Thanks @witgo for the quick turnaround -- will take a look at it. One thing that concerns me slightly now is that we always need to specify a Maven profile. Is there a problem if we add commons-lang in the Hadoop 2 profiles too? Do those break? |
|
Basically one specific problem here is that when we publish to Maven Central, and people link to our POM, they will get a dependency on Spark without any profiles activated. So if they test their app, they might run into this error. If that's not the case I'd at least like to understand why. Our build should work by default without any profiles for this reason. |
|
@mateiz |
|
So if a user adds a dependency in their project on spark-core 1.0.0, spark-hive 1.0.0 and a hadoop-client that Hive supports, will things work? I'll try this too but it's good to test locally. You need to do mvn -DskipTests install to publish the packages into your local Maven cache, then build a standalone Spark app against it. We need to test both a dependency on spark-core by itself and one on spark-core with spark-hive. |
|
I disagree with this change. At least please see comment at https://issues.apache.org/jira/browse/SPARK-1875 first. |
|
The fix suggested there does seem better, namely don't exclude commons-lang. |
|
@mateiz How about this solution? |
…nProtoConverter
### What changes were proposed in this pull request?
This PR refactors the `LiteralExpressionProtoConverter` to use `CatalystTypeConverters` for consistent type conversion, eliminating code duplication and improving maintainability.
**Key changes:**
1. **Simplified `LiteralExpressionProtoConverter.toCatalystExpression()`**: Replaced the large switch statement (86 lines) with a clean 3-line implementation that leverages existing conversion utilities
2. **Added TIME type support**: Added missing TIME literal type conversion in `LiteralValueProtoConverter.toScalaValue()`
### Why are the changes needed?
1. **Type conversion issues**: Some complex nested data structures (e.g., arrays of case classes) failed to convert to Catalyst's internal representation when using `expressions.Literal.create(...)`.
2. **Inconsistent behaviors**: Differences in behavior between Spark Connect and classic Spark for the same data types (e.g., Decimal).
### Does this PR introduce _any_ user-facing change?
**Yes** - Users can now successfully use `typedLit` with an array of case classes. Previously, attempting to use `typedlit(Array(CaseClass(1, "a")))` would fail (please see the code piece below for details), but now it works correctly and converts case classes to proper struct representations.
```scala
import org.apache.spark.sql.functions.typedlit
case class CaseClass(a: Int, b: String)
spark.sql("select 1").select(typedlit(Array(CaseClass(1, "a")))).collect()
// Below is the error message:
"""
org.apache.spark.SparkIllegalArgumentException: requirement failed: Literal must have a corresponding value to array<struct<a:int,b:string>>, but class GenericArrayData found.
scala.Predef$.require(Predef.scala:337)
org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:306)
org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:456)
org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:206)
org.apache.spark.sql.connect.planner.LiteralExpressionProtoConverter$.toCatalystExpression(LiteralExpressionProtoConverter.scala:103)
"""
```
Besides, some catalyst values (e.g., Decimal 89.97620 -> 89.976200000000000000) have changed. Please see the changes in `explain-results/` for details.
```scala
import org.apache.spark.sql.functions.typedlit
spark.sql("select 1").select(typedlit(BigDecimal(8997620, 5)),typedlit(Array(BigDecimal(8997620, 5), BigDecimal(8997621, 5)))).explain()
// Current explain() output:
"""
Project [89.97620 AS 89.97620#819, [89.97620,89.97621] AS ARRAY(89.97620BD, 89.97621BD)#820]
"""
// Expected explain() output (i.e., same as the classic mode):
"""
Project [89.976200000000000000 AS 89.976200000000000000#132, [89.976200000000000000,89.976210000000000000] AS ARRAY(89.976200000000000000BD, 89.976210000000000000BD)#133]
"""
```
### How was this patch tested?
`build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"`
`build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"`
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.4.5
Closes #52188 from heyihong/SPARK-53438.
Authored-by: Yihong He <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
...t Hadoop 1