forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 2
[pull] master from apache:master #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### What changes were proposed in this pull request? This PR proposes to mark non-public API as package private. E.g. private[connect]. ### Why are the changes needed? This is to control our API surface and don't expose non-public API. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #38392 from amaliujia/SPARK-40914. Authored-by: Rui Wang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
… to avoid wrong check result when `expectedMessageParameters.size <= 4`
### What changes were proposed in this pull request?
This pr refactor `AnalysisTest#assertAnalysisErrorClass` method to use `e.messageParameters != expectedMessageParameters` instead of `!e.messageParameters.sameElements(expectedMessageParameters)` to avoid wrong check result when `expectedMessageParameters.size <= 4`
### Why are the changes needed?
Avoid wrong check result of `AnalysisTest#assertAnalysisErrorClass` when `expectedMessageParameters.size <= 4`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass GitHub Actions
- Manual test:
```scala
Welcome to Scala 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_352).
Type in expressions for evaluation. Or try :help.
scala> :paste
// Entering paste mode (ctrl-D to finish)
val messageParameters = Map(
"exprName" -> "`window_duration`",
"valueRange" -> s"(0, 9223372036854775807]",
"currentValue" -> "-1000000L",
"sqlExpr" -> "\"window(2016-01-01 01:01:01, -1000000, 1000000, 0)\""
)
val expectedMessageParameters = Map(
"sqlExpr" -> "\"window(2016-01-01 01:01:01, -1000000, 1000000, 0)\"",
"exprName" -> "`window_duration`",
"valueRange" -> s"(0, 9223372036854775807]",
"currentValue" -> "-1000000L"
)
val tret = !messageParameters.sameElements(expectedMessageParameters)
val fret = messageParameters != expectedMessageParameters
// Exiting paste mode, now interpreting.
messageParameters: scala.collection.immutable.Map[String,String] = Map(exprName -> `window_duration`, valueRange -> (0, 9223372036854775807], currentValue -> -1000000L, sqlExpr -> "window(2016-01-01 01:01:01, -1000000, 1000000, 0)")
expectedMessageParameters: scala.collection.immutable.Map[String,String] = Map(sqlExpr -> "window(2016-01-01 01:01:01, -1000000, 1000000, 0)", exprName -> `window_duration`, valueRange -> (0, 9223372036854775807], currentValue -> -1000000L)
tret: Boolean = true
fret: Boolean = false
```
Closes #38396 from LuciferYang/SPARK-40919.
Authored-by: yangjie01 <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
…se in Scala side ### What changes were proposed in this pull request? This PR adds `assume` in the Python test added in #33559. ### Why are the changes needed? In some testing environment, Python does not exist. This is consistent with other tests in this file. Otherwise, it'd fails as below: ``` java.lang.RuntimeException: Python availability: [true], pyspark availability: [false] at org.apache.spark.sql.IntegratedUDFTestUtils$.pythonFunc$lzycompute(IntegratedUDFTestUtils.scala:192) at org.apache.spark.sql.IntegratedUDFTestUtils$.org$apache$spark$sql$IntegratedUDFTestUtils$$pythonFunc(IntegratedUDFTestUtils.scala:172) at org.apache.spark.sql.IntegratedUDFTestUtils$TestPythonUDF$$anon$1.<init>(IntegratedUDFTestUtils.scala:337) at org.apache.spark.sql.IntegratedUDFTestUtils$TestPythonUDF.udf$lzycompute(IntegratedUDFTestUtils.scala:334) at org.apache.spark.sql.IntegratedUDFTestUtils$TestPythonUDF.udf(IntegratedUDFTestUtils.scala:334) at org.apache.spark.sql.IntegratedUDFTestUtils$TestPythonUDF.apply(IntegratedUDFTestUtils.scala:359) at org.apache.spark.sql.execution.python.PythonUDFSuite.$anonfun$new$11(PythonUDFSuite.scala:105) ... ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually checked. Closes #38407 from HyukjinKwon/SPARK-34265. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? Upgrade actions/setup-python to v4 ### Why are the changes needed? - https://github.com/actions/setup-python/releases/tag/v3.0.0: upgrade to node 16 - https://github.com/actions/setup-python/releases/tag/v4.3.0: cleanup setoutput warning ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? CI passed Closes #38408 from Yikun/setup-python. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ror classes ### What changes were proposed in this pull request? This pr replace `TypeCheckFailure` by `DataTypeMismatch` in type checks in the time window expressions: ### Why are the changes needed? Migration onto error classes unifies Spark SQL error messages. ### Does this PR introduce _any_ user-facing change? Yes. The PR changes user-facing error messages. ### How was this patch tested? - Pass GitHub Actions - Manual test: ``` build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.analysis.AnalysisErrorSuite" -Pscala-2.13 ``` passed Closes #38394 from LuciferYang/SPARK-40759. Authored-by: yangjie01 <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…s onto error classes ### What changes were proposed in this pull request? This pr aims to replace TypeCheckFailure by DataTypeMismatch in type checks in the high-order functions expressions, includes: - 1. ArraySort (2): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L403-L407 - 2. ArrayAggregate (1): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L807 - 3. MapZipWith (1): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L1028 ### Why are the changes needed? Migration onto error classes unifies Spark SQL error messages. ### Does this PR introduce _any_ user-facing change? Yes. The PR changes user-facing error messages. ### How was this patch tested? - Update existed UT - Pass GA. Closes #38359 from panbingkun/SPARK-40751. Authored-by: panbingkun <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…as.read_csv` ### What changes were proposed in this pull request? as discussed in https://issues.apache.org/jira/browse/SPARK-40922: > The path argument of `pyspark.pandas.read_csv(path, ...)` currently has type annotation `str` and is documented as > > path : str > The path string storing the CSV file to be read. >The implementation however uses `pyspark.sql.DataFrameReader.csv(path, ...)` which does support multiple paths: > > path : str or list > string, or list of strings, for input path(s), > or RDD of Strings storing CSV rows. > This PR updates the type annotation and documentation of `path` argument of `pyspark.pandas.read_csv` ### Why are the changes needed? Loading multiple CSV files at once is a useful feature to have and should be documented ### Does this PR introduce _any_ user-facing change? it documents and existing feature ### How was this patch tested? No need for tests (so far): only type annotations and docblocks were changed Closes #38399 from soxofaan/SPARK-40922-pyspark-pandas-read-csv-multiple-paths. Authored-by: Stefaan Lippens <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? Fix for a bug in `Unhex` function when there is an odd number of symbols in the input string. ### Why are the changes needed? `Unhex` function and other functions depending on it (e.g. `ToBinary`) produce incorrect output. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #38402 from vitaliili-db/unhex. Authored-by: Vitalii Li <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…supportedOperationException ### What changes were proposed in this pull request? This pr aims to replace UnsupportedOperationException with SparkUnsupportedOperationException. ### Why are the changes needed? 1.When I work on https://issues.apache.org/jira/browse/SPARK-40889, I found `QueryExecutionErrors.unsupportedPartitionTransformError` throw **UnsupportedOperationException**(but not **SparkUnsupportedOperationException**), it seem not to fit into the new error framework. https://github.com/apache/spark/blob/a27b459be3ca2ad2d50b9d793b939071ca2270e2/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala#L71-L72 2.`QueryExecutionErrors.unsupportedPartitionTransformError` throw SparkUnsupportedOperationException, but UT catch `UnsupportedOperationException`. https://github.com/apache/spark/blob/a27b459be3ca2ad2d50b9d793b939071ca2270e2/sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala#L288-L301 https://github.com/apache/spark/blob/a27b459be3ca2ad2d50b9d793b939071ca2270e2/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L904-L909 https://github.com/apache/spark/blob/a27b459be3ca2ad2d50b9d793b939071ca2270e2/core/src/main/scala/org/apache/spark/SparkException.scala#L144-L154 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existed UT. Closes #38387 from panbingkun/replace_UnsupportedOperationException. Authored-by: panbingkun <[email protected]> Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? This pr aims to upgrade zstd-jni to 1.5.2-5 ### Why are the changes needed? This version start to support magic less data frames: - luben/zstd-jni#151 - luben/zstd-jni#235 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38412 from LuciferYang/zstd-1.5.2-5. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to upgrade Apache Arrow to 10.0.0 ### Why are the changes needed? This version bring some bug fix and improvements, the official release notes as follows: - https://arrow.apache.org/release/10.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38369 from LuciferYang/SPARK-40895. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…e `Dockerfile.java17` ### What changes were proposed in this pull request? This PR aims to use `Java 17` in K8s Dockerfile by default and remove `Dockerfile.java17`. ### Why are the changes needed? To update for Apache Spark 3.4.0. ``` $ docker run -it --rm kubespark/spark:dev cat /etc/os-release | grep PRETTY PRETTY_NAME="Ubuntu 22.04.1 LTS" ``` ``` $ docker run -it --rm kubespark/spark:dev java -version | grep 64 OpenJDK 64-Bit Server VM Temurin-17.0.4.1+1 (build 17.0.4.1+1, mixed mode, sharing) ``` ### Does this PR introduce _any_ user-facing change? Yes, but only the published docker images will get the latest OS (Ubuntu 22.04.1 LTS) and Java (17 LTS). ### How was this patch tested? Pass the CIs. Closes #38417 from dongjoon-hyun/SPARK-40941. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…llow multiple window_time calls ### What changes were proposed in this pull request? This PR proposes to loosen the requirement of window_time rule to allow multiple distinct window_time calls. After this change, users can call the window_time function with different windows in the same logical node (select, where, etc.). Given that we allow multiple calls of window_time in projection, we no longer be able to use the reserved column name "window_time". This PR picked up the SQL representation of the WindowTime, to distinguish each distinct function call. (This is different from time window/session window, but "arguably" saying, they are incorrect. Just that we can't fix them now since the change would incur backward incompatibility...) ### Why are the changes needed? The rule for window time followed the existing rules of time window / session window which only allows a single function call in a same projection (strictly saying, it considers the call of function as once if the function is called with same parameters). For time window/session window rules , the restriction makes sense since allowing this would produce cartesian product of rows (although Spark can handle it). But given that window_time only produces one value, the restriction no longer makes sense. ### Does this PR introduce _any_ user-facing change? Yes since it changes the resulting column name from window_time function call, but the function is not released yet. ### How was this patch tested? New test case. Closes #38361 from HeartSaVioR/SPARK-40892. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
### What changes were proposed in this pull request? This PR proposes to upgrade pandas version to 1.5.0 since the new pandas version is released. Please refer to [What's new in 1.5.0](https://pandas.pydata.org/docs/whatsnew/v1.5.0.html) for more detail. ### Why are the changes needed? Pandas API on Spark should follow the latest pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests should pass Closes #37955 from itholic/SPARK-40512. Authored-by: itholic <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
…atedScalarSubquery ### What changes were proposed in this pull request? This PR updates the `splitSubquery` in `RewriteCorrelatedScalarSubquery` to support non-aggregated one-row subquery. In CheckAnalysis, we allow three types of correlated scalar subquery patterns: 1. SubqueryAlias/Project + Aggregate 2. SubqueryAlias/Project + Filter + Aggregate 3. SubqueryAlias/Project + LogicalPlan (maxRows <= 1) https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L851-L856 We should support the thrid case in `splitSubquery` to avoid `Unexpected operator` exceptions. ### Why are the changes needed? To fix an issue with correlated subquery rewrite. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests. Closes #38336 from allisonwang-db/spark-40862-split-subquery. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? 1. Remove unused error classes: INCONSISTENT_BEHAVIOR_CROSS_VERSION.FORMAT_DATETIME_BY_NEW_PARSER, NAMESPACE_ALREADY_EXISTS, NAMESPACE_NOT_EMPTY, NAMESPACE_NOT_FOUND. 2. Rename the error class WRONG_NUM_PARAMS to WRONG_NUM_ARGS. 3. Use correct error class INDEX_ALREADY_EXISTS in the exception `IndexAlreadyExistsException` instead of INDEX_NOT_FOUND. 4. Quote regexp patterns by ''. 5. Fix indentations in [QueryCompilationErrors.scala](https://github.com/apache/spark/pull/38398/files#diff-744ac13f6fe074fddeab09b407404bffa2386f54abc83c501e6e1fe618f6db56). ### Why are the changes needed? To address tech debts. ### Does this PR introduce _any_ user-facing change? Yes, it modifies user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt "test:testOnly *SQLQuerySuite" $ build/sbt "test:testOnly *StringExpressionsSuite" $ build/sbt "test:testOnly *.RegexpExpressionsSuite" ``` Closes #38398 from MaxGekk/remove-unused-error-classes. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…me API ### What changes were proposed in this pull request? This PR migrates all existing proto tests to be DataFrame API based. ### Why are the changes needed? 1. The goal for proto tests is to test the capability of representing DataFrames by the Connect proto. So comparing with DataFrame API is more accurate. 2. There are some Connect plan execution requiring SparkSession anyway. We can unify all tests into one suite by only using DataFrame API (e.g. We can merge `SparkConnectDeduplicateSuite.scala` into `SparkConnectProtoSuite.scala`. 3. This also enables the possibility that we can also test result (not only plan) in the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT. Closes #38406 from amaliujia/refactor_server_tests. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a follow-up of #37671. ### Why are the changes needed? Since #37671 added `openpyxl` for PySpark test environments and re-enabled `test_to_excel` test, we need to add it to `requirements.txt` as PySpark test dependency explicitly. ### Does this PR introduce _any_ user-facing change? No. This is a test dependency. ### How was this patch tested? Manually. Closes #38425 from dongjoon-hyun/SPARK-40229. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Yikun Jiang <[email protected]>
…lass` by reusing the `SparkFunSuite#checkError`
### What changes were proposed in this pull request?
This pr aims to refactor `AnalysisTest#assertAnalysisErrorClass` method by reusing the `checkError` method in `SparkFunSuite`.
On the other hand, the signature of `AnalysisTest#assertAnalysisErrorClass` method is changed from
```
protected def assertAnalysisErrorClass(
inputPlan: LogicalPlan,
expectedErrorClass: String,
expectedMessageParameters: Map[String, String],
caseSensitive: Boolean = true,
line: Int = -1,
pos: Int = -1): Unit
```
to
```
protected def assertAnalysisErrorClass(
inputPlan: LogicalPlan,
expectedErrorClass: String,
expectedMessageParameters: Map[String, String],
queryContext: Array[QueryContext] = Array.empty,
caseSensitive: Boolean = true): Unit
```
Then when we need to use `queryContext` instead of `line + pos` for assertion
### Why are the changes needed?
`assertAnalysisErrorClass` and `checkError` does the same work.
### Does this PR introduce _any_ user-facing change?
No, just for test
### How was this patch tested?
- Pass GitHub Actions
Closes #38413 from LuciferYang/simplify-assertAnalysisErrorClass.
Authored-by: yangjie01 <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? This PR aims to replace 'intercept' with 'Check error classes' in PlanResolutionSuite. ### Why are the changes needed? The changes improve the error framework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *PlanResolutionSuite" ``` Closes #38421 from panbingkun/SPARK-40889. Authored-by: panbingkun <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…rquetFileFormat on producing columnar output
### What changes were proposed in this pull request?
We move the decision about supporting columnar output based on WSCG one level from ParquetFileFormat / OrcFileFormat up to FileSourceScanExec, and pass it as a new required option for ParquetFileFormat / OrcFileFormat. Now the semantics is as follows:
* `ParquetFileFormat.supportsBatch` and `OrcFileFormat.supportsBatch` returns whether it **can**, not necessarily **will** return columnar output.
* To return columnar output, an option `FileFormat.OPTION_RETURNING_BATCH` needs to be passed to `buildReaderWithPartitionValues` in these two file formats. It should only be set to `true` if `supportsBatch` is also `true`, but it can be set to `false` if we don't want columnar output nevertheless - this way, `FileSourceScanExec` can set it to false when there are more than 100 columsn for WSCG, and `ParquetFileFormat` / `OrcFileFormat` doesn't have to concern itself about WSCG limits.
* To avoid not passing it by accident, this option is made required. Making it required requires updating a few places that use it, but an error resulting from this is very obscure. It's better to fail early and explicitly here.
### Why are the changes needed?
This explains it for `ParquetFileFormat`. `OrcFileFormat` had exactly the same issue.
`java.lang.ClassCastException: org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to org.apache.spark.sql.catalyst.InternalRow` was being thrown because ParquetReader was outputting columnar batches, while FileSourceScanExec expected row output.
The mismatch comes from the fact that `ParquetFileFormat.supportBatch` depends on `WholeStageCodegenExec.isTooManyFields(conf, schema)`, where the threshold is 100 fields.
When this is used in `FileSourceScanExec`:
```
override lazy val supportsColumnar: Boolean = {
relation.fileFormat.supportBatch(relation.sparkSession, schema)
}
```
the `schema` comes from output attributes, which includes extra metadata attributes.
However, inside `ParquetFileFormat.buildReaderWithPartitionValues` it was calculated again as
```
relation.fileFormat.buildReaderWithPartitionValues(
sparkSession = relation.sparkSession,
dataSchema = relation.dataSchema,
partitionSchema = relation.partitionSchema,
requiredSchema = requiredSchema,
filters = pushedDownFilters,
options = options,
hadoopConf = hadoopConf
...
val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields)
...
val returningBatch = supportBatch(sparkSession, resultSchema)
```
Where `requiredSchema` and `partitionSchema` wouldn't include the metadata columns:
```
FileSourceScanExec: output: List(c1#4608L, c2#4609L, ..., c100#4707L, file_path#6388)
FileSourceScanExec: dataSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
FileSourceScanExec: partitionSchema: StructType()
FileSourceScanExec: requiredSchema: StructType(StructField(c1,LongType,true),StructField(c2,LongType,true),...,StructField(c100,LongType,true))
```
Column like `file_path#6388` are added by the scan, and contain metadata added by the scan, not by the file reader which concerns itself with what is within the file.
### Does this PR introduce _any_ user-facing change?
Not a public API change, but it is now required to pass `FileFormat.OPTION_RETURNING_BATCH` in `options` to `ParquetFileFormat.buildReaderWithPartitionValues`. The only user of this API in Apache Spark is `FileSourceScanExec`.
### How was this patch tested?
Tests added
Closes #38397 from juliuszsompolski/SPARK-40918.
Authored-by: Juliusz Sompolski <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
The messages returned by allGather may be overridden by the following barrier APIs, eg,
``` scala
val messages: Array[String] = context.allGather("ABC")
context.barrier()
```
the `messages` may be like Array("", ""), but we're expecting Array("ABC", "ABC")
The root cause of this issue is the [messages got by allGather](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/BarrierTaskContext.scala#L102) pointing to the [original message](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/BarrierCoordinator.scala#L107) in the local mode. So when the following barrier APIs changed the messages, then the allGather message will be changed accordingly.
Finally, users can't get the correct result.
This PR fixed this issue by sending back the cloned messages.
### Why are the changes needed?
The bug mentioned in this description may block some external SPARK ML libraries which heavily depend on the spark barrier API to do some synchronization. If the barrier mechanism can't guarantee the correctness of the barrier APIs, it will be a disaster for external SPARK ML libraries.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
I added a unit test, with this PR, the unit test can pass
Closes #38410 from wbo4958/allgather-issue.
Authored-by: Bobby Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…ed if `pandas` doesn't exist
### What changes were proposed in this pull request?
This PR aims to skip `pyspark-connect` unit tests when `pandas` is unavailable.
### Why are the changes needed?
**BEFORE**
```
% python/run-tests --modules pyspark-connect
Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
Will test against the following Python executables: ['python3.9']
Will test the following Python modules: ['pyspark-connect']
python3.9 python_implementation is CPython
python3.9 version is: Python 3.9.15
Starting test(python3.9): pyspark.sql.tests.connect.test_connect_plan_only (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/f14573f1-131f-494a-a015-8b4762219fb5/python3.9__pyspark.sql.tests.connect.test_connect_plan_only__86sd4pxg.log)
Starting test(python3.9): pyspark.sql.tests.connect.test_connect_column_expressions (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/51391499-d21a-4c1d-8b79-6ac52859a4c9/python3.9__pyspark.sql.tests.connect.test_connect_column_expressions__kn__9aur.log)
Starting test(python3.9): pyspark.sql.tests.connect.test_connect_basic (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/7854cbef-e40d-4090-a37d-5a5314eb245f/python3.9__pyspark.sql.tests.connect.test_connect_basic__i1rutevd.log)
Starting test(python3.9): pyspark.sql.tests.connect.test_connect_select_ops (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/6f947453-7481-4891-81b0-169aaac8c6ee/python3.9__pyspark.sql.tests.connect.test_connect_select_ops__5sxao0ji.log)
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python3.9/3.9.15/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/homebrew/Cellar/python3.9/3.9.15/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/sql/tests/connect/test_connect_basic.py", line 22, in <module>
import pandas
ModuleNotFoundError: No module named 'pandas'
```
**AFTER**
```
% python/run-tests --modules pyspark-connect
Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
Will test against the following Python executables: ['python3.9']
Will test the following Python modules: ['pyspark-connect']
python3.9 python_implementation is CPython
python3.9 version is: Python 3.9.15
Starting test(python3.9): pyspark.sql.tests.connect.test_connect_basic (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/571609c0-3070-476c-afbe-56e215eb5647/python3.9__pyspark.sql.tests.connect.test_connect_basic__4e9k__5x.log)
Starting test(python3.9): pyspark.sql.tests.connect.test_connect_column_expressions (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/4a30d035-e392-4ad2-ac10-5d8bc5421321/python3.9__pyspark.sql.tests.connect.test_connect_column_expressions__c9x39tvp.log)
Starting test(python3.9): pyspark.sql.tests.connect.test_connect_plan_only (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/eea0b5db-9a92-4fbb-912d-a59daaf73f8e/python3.9__pyspark.sql.tests.connect.test_connect_plan_only__0p9ivnod.log)
Starting test(python3.9): pyspark.sql.tests.connect.test_connect_select_ops (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/6069c664-afd9-4a3c-a0cc-f707577e039e/python3.9__pyspark.sql.tests.connect.test_connect_select_ops__sxzrtiqa.log)
Finished test(python3.9): pyspark.sql.tests.connect.test_connect_column_expressions (1s) ... 2 tests were skipped
Finished test(python3.9): pyspark.sql.tests.connect.test_connect_select_ops (1s) ... 2 tests were skipped
Finished test(python3.9): pyspark.sql.tests.connect.test_connect_plan_only (1s) ... 10 tests were skipped
Finished test(python3.9): pyspark.sql.tests.connect.test_connect_basic (1s) ... 6 tests were skipped
Tests passed in 1 seconds
Skipped tests in pyspark.sql.tests.connect.test_connect_basic with python3.9:
test_limit_offset (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.002s)
test_schema (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.000s)
test_simple_datasource_read (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.000s)
test_simple_explain_string (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.000s)
test_simple_read (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.000s)
test_simple_udf (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests) ... skip (0.000s)
Skipped tests in pyspark.sql.tests.connect.test_connect_column_expressions with python3.9:
test_column_literals (pyspark.sql.tests.connect.test_connect_column_expressions.SparkConnectColumnExpressionSuite) ... skip (0.000s)
test_simple_column_expressions (pyspark.sql.tests.connect.test_connect_column_expressions.SparkConnectColumnExpressionSuite) ... skip (0.000s)
Skipped tests in pyspark.sql.tests.connect.test_connect_plan_only with python3.9:
test_all_the_plans (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.002s)
test_datasource_read (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s)
test_deduplicate (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.001s)
test_filter (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s)
test_limit (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s)
test_offset (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s)
test_relation_alias (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s)
test_sample (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.001s)
test_simple_project (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s)
test_simple_udf (pyspark.sql.tests.connect.test_connect_plan_only.SparkConnectTestsPlanOnly) ... skip (0.000s)
Skipped tests in pyspark.sql.tests.connect.test_connect_select_ops with python3.9:
test_join_with_join_type (pyspark.sql.tests.connect.test_connect_select_ops.SparkConnectToProtoSuite) ... skip (0.002s)
test_select_with_columns_and_strings (pyspark.sql.tests.connect.test_connect_select_ops.SparkConnectToProtoSuite) ... skip (0.000s)
```
### Does this PR introduce _any_ user-facing change?
No. This is a test-only PR.
### How was this patch tested?
Manually run the following.
```
$ pip3 uninstall pandas
$ python/run-tests --modules pyspark-connect
```
Closes #38426 from dongjoon-hyun/SPARK-40951.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
pull bot
pushed a commit
that referenced
this pull request
Nov 22, 2024
…ead pool ### What changes were proposed in this pull request? This PR aims to use a meaningful class name prefix for REST Submission API thread pool instead of the default value of Jetty QueuedThreadPool, `"qtp"+super.hashCode()`. https://github.com/dekellum/jetty/blob/3dc0120d573816de7d6a83e2d6a97035288bdd4a/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L64 ### Why are the changes needed? This is helpful during JVM investigation. **BEFORE (4.0.0-preview2)** ``` $ SPARK_MASTER_OPTS='-Dspark.master.rest.enabled=true' sbin/start-master.sh $ jstack 28217 | grep qtp "qtp1925630411-52" #52 daemon prio=5 os_prio=31 cpu=0.07ms elapsed=19.06s tid=0x0000000134906c10 nid=0xde03 runnable [0x0000000314592000] "qtp1925630411-53" #53 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=19.06s tid=0x0000000134ac6810 nid=0xc603 runnable [0x000000031479e000] "qtp1925630411-54" #54 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=19.06s tid=0x000000013491ae10 nid=0xdc03 runnable [0x00000003149aa000] "qtp1925630411-55" #55 daemon prio=5 os_prio=31 cpu=0.08ms elapsed=19.06s tid=0x0000000134ac9810 nid=0xc803 runnable [0x0000000314bb6000] "qtp1925630411-56" #56 daemon prio=5 os_prio=31 cpu=0.04ms elapsed=19.06s tid=0x0000000134ac9e10 nid=0xda03 runnable [0x0000000314dc2000] "qtp1925630411-57" #57 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=19.06s tid=0x0000000134aca410 nid=0xca03 runnable [0x0000000314fce000] "qtp1925630411-58" #58 daemon prio=5 os_prio=31 cpu=0.04ms elapsed=19.06s tid=0x0000000134acaa10 nid=0xcb03 runnable [0x00000003151da000] "qtp1925630411-59" #59 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=19.06s tid=0x0000000134acb010 nid=0xcc03 runnable [0x00000003153e6000] "qtp1925630411-60-acceptor-0108e9815-ServerConnector1e497474{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #60 daemon prio=3 os_prio=31 cpu=0.11ms elapsed=19.06s tid=0x00000001317ffa10 nid=0xcd03 runnable [0x00000003155f2000] "qtp1925630411-61-acceptor-11d90f2aa-ServerConnector1e497474{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #61 daemon prio=3 os_prio=31 cpu=0.10ms elapsed=19.06s tid=0x00000001314ed610 nid=0xcf03 waiting on condition [0x00000003157fe000] ``` **AFTER** ``` $ SPARK_MASTER_OPTS='-Dspark.master.rest.enabled=true' sbin/start-master.sh $ jstack 28317 | grep StandaloneRestServer "StandaloneRestServer-52" #52 daemon prio=5 os_prio=31 cpu=0.09ms elapsed=60.06s tid=0x00000001284a8e10 nid=0xdb03 runnable [0x000000032cfce000] "StandaloneRestServer-53" #53 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284acc10 nid=0xda03 runnable [0x000000032d1da000] "StandaloneRestServer-54" #54 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284ae610 nid=0xd803 runnable [0x000000032d3e6000] "StandaloneRestServer-55" #55 daemon prio=5 os_prio=31 cpu=0.09ms elapsed=60.06s tid=0x00000001284aec10 nid=0xd703 runnable [0x000000032d5f2000] "StandaloneRestServer-56" #56 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284af210 nid=0xc803 runnable [0x000000032d7fe000] "StandaloneRestServer-57" #57 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284af810 nid=0xc903 runnable [0x000000032da0a000] "StandaloneRestServer-58" #58 daemon prio=5 os_prio=31 cpu=0.06ms elapsed=60.06s tid=0x00000001284afe10 nid=0xcb03 runnable [0x000000032dc16000] "StandaloneRestServer-59" #59 daemon prio=5 os_prio=31 cpu=0.05ms elapsed=60.06s tid=0x00000001284b0410 nid=0xcc03 runnable [0x000000032de22000] "StandaloneRestServer-60-acceptor-04aefbaa8-ServerConnector44284d85{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #60 daemon prio=3 os_prio=31 cpu=0.13ms elapsed=60.05s tid=0x000000015cda1a10 nid=0xcd03 runnable [0x000000032e02e000] "StandaloneRestServer-61-acceptor-148976251-ServerConnector44284d85{HTTP/1.1, (http/1.1)}{M3-Max.local:6066}" #61 daemon prio=3 os_prio=31 cpu=0.12ms elapsed=60.05s tid=0x000000015cd1c810 nid=0xce03 waiting on condition [0x000000032e23a000] ``` ### Does this PR introduce _any_ user-facing change? No, the thread names are accessed during the debugging. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48924 from dongjoon-hyun/SPARK-50385. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: panbingkun <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )