[WIP] Add metadata to MapStatus #4

attilapiros · 2022-01-21T18:40:50Z

Regarding storing and retrieving of MapOutputMetadata my idea was to add the metadata directly into MapStatus and delegate the serialization/deserialization of the metadata into a new class MapOutputMetadataExternalizer which is constructed by the ShuffleManager.

This way Uber RSS could fill location by the executor's blockmanger ID where the map was running and store the RSS related block coordinates as a custom MapOutputMetadata.

Advantage:

With solution a single shuffle solution can handle different kind of MapOutputMetadatas as the MapOutputMetadataExternalizer#writeExternal could write a type indicator first (single Byte for example depending on the the MapOutputMetadata type) and the readExternal can create the right instance depending on the indicator read

Disadvantage:

At the retrieve I had to bind the MapStatus location and the MapOutputMetadata together:

spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala

Lines 1651 to 1654 in 72f3af3

    
           val blockManagerIdWithMeta = BlockManagerIdWithMeta(status.location, status.metadata) 
        
           splitsByAddress 
        
             .getOrElseUpdate(blockManagerIdWithMeta, ListBuffer()) += 
        
             ((ShuffleBlockId(shuffleId, status.mapId, part), size, mapIndex))

This feels bad. One alternative solution is to use only the MapOutputMetadata and forget the locations in this kind of retrieve...

But if we need both the location and MapOutputMetadata then a much better solution would be apache#31876 which is stale PR but we can help on that.

hiboyang · 2022-01-23T16:34:01Z

Thanks for the change! Looks great adding a metadata field into MapStatus! Just to see whether there is better way to serialize/deserialize metadata instead of depending on ShuffleManager. Is it possible to add serialize/deserialize methods into the metadata class itself?

attilapiros · 2022-01-24T10:15:01Z

Is it possible to add serialize/deserialize methods into the metadata class itself?

We can do that but the instantiation of the right metadata type is still needed to be done depending on the Shuffle manager implementation.

Let me look into that.

github-actions · 2022-10-05T00:46:03Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

### What changes were proposed in this pull request? This PR introduces sasl retry count in RetryingBlockTransferor. ### Why are the changes needed? Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario: 1. SaslTimeoutException 2. IOException 3. SaslTimeoutException 4. IOException Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4. Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test is added, courtesy of Mridul. Closes apache#39611 from tedyu/sasl-cnt. Authored-by: Ted Yu <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

…edExpression() ### What changes were proposed in this pull request? In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`. ### Why are the changes needed? This fixes a regression caused by apache#39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`. One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error). ### Does this PR introduce _any_ user-facing change? This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions. Example running the SQL command: ```sql select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) ``` example error message before the fix: ``` java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3] ``` after the fix this error is gone. ### How was this patch tested? Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom. Closes apache#40473 from rednaxelafx/spark-42851. Authored-by: Kris Mok <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… throw internal error ### What changes were proposed in this pull request? This PR fixes the error messages and classes when Python UDFs are used in higher order functions. ### Why are the changes needed? To show the proper user-facing exceptions with error classes. ### Does this PR introduce _any_ user-facing change? Yes, previously it threw internal error such as: ```python from pyspark.sql.functions import transform, udf, col, array spark.range(1).select(transform(array("id"), lambda x: udf(lambda y: y)(x))).collect() ``` Before: ``` py4j.protocol.Py4JJavaError: An error occurred while calling o74.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 0.0 failed 1 times, most recent failure: Lost task 15.0 in stage 0.0 (TID 15) (ip-192-168-123-103.ap-northeast-2.compute.internal executor driver): org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: <lambda>(lambda x_0#3L)#2 SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) ``` After: ``` pyspark.errors.exceptions.captured.AnalysisException: [INVALID_LAMBDA_FUNCTION_CALL.UNEVALUABLE] Invalid lambda function call. Python UDFs should be used in a lambda function at a higher order function. However, "<lambda>(lambda x_0#3L)" was a Python UDF. SQLSTATE: 42K0D; Project [transform(array(id#0L), lambdafunction(<lambda>(lambda x_0#3L)#2, lambda x_0#3L, false)) AS transform(array(id), lambdafunction(<lambda>(lambda x_0#3L), namedlambdavariable()))#4] +- Range (0, 1, step=1, splits=Some(16)) ``` ### How was this patch tested? Unittest was added ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47079 from HyukjinKwon/SPARK-48706. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Kent Yao <[email protected]>

add metadata to MapStatus

72f3af3

github-actions bot added the CORE label Jan 21, 2022

attilapiros mentioned this pull request Jan 21, 2022

[SPARK-37394][CORE] Skip registering with external shuffle server if a customized shuffle manager is configured apache/spark#34672

Closed

MapOutputMetadataExternalizer -> MapOutputMetadataFactory

14d6ab3

attilapiros mentioned this pull request Jan 24, 2022

[SPARK-34942][API][CORE] Abstract Location in MapStatus to enable support for custom storage apache/spark#31876

Closed

github-actions bot added the Stale label Oct 5, 2022

github-actions bot closed this Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add metadata to MapStatus #4

[WIP] Add metadata to MapStatus #4

Uh oh!

attilapiros commented Jan 21, 2022 •

edited

Loading

Uh oh!

hiboyang commented Jan 23, 2022

Uh oh!

attilapiros commented Jan 24, 2022

Uh oh!

github-actions bot commented Oct 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	val blockManagerIdWithMeta = BlockManagerIdWithMeta(status.location, status.metadata)
	splitsByAddress
	.getOrElseUpdate(blockManagerIdWithMeta, ListBuffer()) +=
	((ShuffleBlockId(shuffleId, status.mapId, part), size, mapIndex))

[WIP] Add metadata to MapStatus #4

[WIP] Add metadata to MapStatus #4

Uh oh!

Conversation

attilapiros commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hiboyang commented Jan 23, 2022

Uh oh!

attilapiros commented Jan 24, 2022

Uh oh!

github-actions bot commented Oct 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

attilapiros commented Jan 21, 2022 •

edited

Loading