[SPARK-48589][SQL][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source #46944

eason-yuchen-liu · 2024-06-11T20:29:33Z

What changes were proposed in this pull request?

This PR defines two new options, snapshotStartBatchId and snapshotPartitionId, for the existing state reader. Both of them should be provided at the same time.

When there is no snapshot file at snapshotStartBatch (note there is an off-by-one issue between version and batch Id), throw an exception.
Otherwise, the reader should continue to rebuild the state by reading delta files only, and ignore all snapshot files afterwards.
Note that if a batchId option is already specified. That batchId is the ending batchId, we should then end at that batchId.
This feature supports state generated by HDFS state store provider and RocksDB state store provider with changelog checkpointing enabled. It does not support RocksDB with changelog disabled which is the default for RocksDB.

Why are the changes needed?

Sometimes when a snapshot is corrupted, users want to bypass it when reading a later state. This PR gives user ability to specify the starting snapshot version and partition. This feature can be useful for debugging purpose.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Created test cases for testing edge cases for the input of new options. Created test for the new public function replayReadStateFromSnapshot. Created integration test for the new options against four stateful operators: limit, aggregation, deduplication, stream-stream join. Instead of generating states within the tests which is unstable, I prepare golden files for the integration test.

Was this patch authored or co-authored using generative AI tooling?

No.

…-liu/spark into skipSnapshotAtBatch

eason-yuchen-liu · 2024-06-12T20:26:17Z

Is there necessity to add an end-to-end test for the options? If so, I can create another PR. The way to construct it is probably by sleeping for a sufficiently long time for maintenance task to run. @anishshri-db @HeartSaVioR

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateScanBuilder.scala

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

WweiL · 2024-06-13T00:05:38Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

+        throw QueryExecutionErrors.failedToReadSnapshotFileNotExistsError(
+          snapshotFile(startVersion), toString(), null)
+      }
+      synchronized { putStateIntoStateCacheMap(startVersion, startVersionMap.get) }


is it possible to refactor this with existing loadMap fcn? or add helper function for shared logic

For HDFS, it is hard because the common part is really small. But for RocksDB, there is room for refactoring. For example, this is PR is to test whether we can extract a common part of both load function. #46927

WweiL · 2024-06-13T00:16:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

+   * @param endVersion checkpoint version to end with
+   */
+  def getStore(startVersion: Long, endVersion: Long): StateStore =
+    throw new SparkUnsupportedOperationException("getStore with startVersion and endVersion " +


can we just put nothing here? like

def getStore(version: Long): StateStore

It seems that we cannot, because to make this method optional, it has to have a default implementation, otherwise a build error will be thrown.

Hmm - what error do you see here ? can you paste it please ?

Building on the assumption that when users create custom state store provider and they do not implement this method because it is optional. They will see errors like

Missing implementation for member of trait StateStoreProvider

WweiL · 2024-06-13T00:18:13Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

 import org.apache.spark.sql.execution.datasources.v2.state.utils.SchemaUtil
 import org.apache.spark.sql.execution.streaming.{CommitLog, MemoryStream, OffsetSeqLog}
-import org.apache.spark.sql.execution.streaming.state.{HDFSBackedStateStoreProvider, RocksDBStateStoreProvider, StateStore}
+import org.apache.spark.sql.execution.streaming.state._


is this because these three are everything in that pkg?

No. The reason is I use three new classes in this pkg. I think it would be too long to include them all. What do you think?

Yea this should be good

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

WweiL · 2024-06-13T00:38:33Z

@WweiL
Tagging myself so it shows on my dashboard

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

common/utils/src/main/resources/error/error-conditions.json

…-liu/spark into skipSnapshotAtBatch

HeartSaVioR · 2024-06-27T07:28:16Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

@@ -796,4 +973,141 @@ abstract class StateDataSourceReadSuite extends StateDataSourceTestBase with Ass
      testForSide("right")
    }
  }
+
+  protected def testSnapshotNotFound(): Unit = {
+    withTempDir(tempDir => {


nit: according to Databricks scala style, this should be withTempDir { tempDir =>, could save one indentation (curly brace)

HeartSaVioR · 2024-06-27T07:33:31Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

+        provider.asInstanceOf[SupportsFineGrainedReplayFromSnapshot]
+          .replayReadStateFromSnapshot(1, 2)
+      }
+      checkError(exc, "CANNOT_LOAD_STATE_STORE.UNCATEGORIZED")


It would be nice if we can provide users the better error message e.g. snapshot file does not exist, but I'm OK with addressing this later.

Let's put it later along with the changelog file not found exception.

HeartSaVioR · 2024-06-27T07:34:44Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

+  }
+
+  protected def testGetReadStoreWithStartVersion(): Unit = {
+    withTempDir(tempDir => {


HeartSaVioR · 2024-06-27T07:35:48Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

+  }
+
+  protected def testSnapshotPartitionId(): Unit = {
+    withTempDir(tempDir => {


HeartSaVioR · 2024-06-27T07:37:30Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

+        .option(StateSourceOptions.SNAPSHOT_START_BATCH_ID, 0)
+        .option(
+          StateSourceOptions.SNAPSHOT_PARTITION_ID,
+          spark.sessionState.conf.numShufflePartitions)


just need to be > 0

I see, it is because of limit operator.

HeartSaVioR · 2024-06-27T07:39:30Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

+    })
+  }
+
+  // Todo: Should also test against state generated by 3.5


Is it remaining TODO, or does not need to be done at all? If we don't need to, let's remove the golden files for 3.5. I guess it's not intentional to test cross version compatibility.

HeartSaVioR · 2024-06-27T07:42:04Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

+    checkAnswer(stateSnapshotDf, stateDf)
+  }
+
+  protected def testSnapshotOnLimitState(providerName: String): Unit = {


General comment for tests using golden file: please leave the code as comment or so how you build the golden file (the query you used), to let other be able to re-build the golden file if needed.

…play

anishshri-db · 2024-06-28T21:59:35Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

+  }
+
+  /**
+   * Consturct the state at endVersion from snapshot from snapshotVersion.


nit: Construct the state at

anishshri-db · 2024-06-28T22:12:15Z

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

    if (!condition) { throw new IllegalStateException(msg) }
  }
+
+  override def replayStateFromSnapshot(snapshotVersion: Long, endVersion: Long): StateStore = {


Can you add a small function comment here ?

anishshri-db · 2024-06-28T22:13:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreErrors.scala

+    errorClass = "CANNOT_LOAD_STATE_STORE.CANNOT_READ_MISSING_SNAPSHOT_FILE",
+    messageParameters = Map(
+      "fileToRead" -> fileToRead,
+      "clazz" -> clazz))


is this a common convention for the parameter naming ? this will be visible in the error message that is thrown, correct ?

It seems so. the parameter names will not appear. I learned from here: https://github.com/apache/spark/blob/6bfeb094248269920df8b107c86f0982404935cd/common/utils/src/main/resources/error/error-conditions.json#L236C54-L236C59

anishshri-db · 2024-06-28T22:17:23Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

+
+  protected def testSnapshotOnDeduplicateState(providerName: String): Unit = {
+    /** The golden files are generated by:
+    withSQLConf({


nit: indent seems odd in these places, but maybe not a big deal for such comments

Will move one tab right.

anishshri-db · 2024-06-28T22:18:13Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

+    }
+     */
+    val resourceUri = this.getClass.getResource(
+      s"/structured-streaming/checkpoint-version-4.0.0/$providerName/limit/"


I thought we were going to run against 3.5.1 and then run the query once to generate the operator metadata. Did we decide against that ?

Strictly saying, the test about checkpoint with no operator metadata to create operator metadata should have been done in state metadata testing. If we don't have one, we'd better to have one, but no need to couple with this PR.

anishshri-db

lgtm - pending some minor comments

HeartSaVioR

Only nits and minors. Thanks for the patience!

HeartSaVioR · 2024-07-02T03:26:28Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

    val newMap = replayLoadedMapForStoreFromSnapshot(snapshotVersion, endVersion)
-    logInfo(log"Retrieved version ${MDC(LogKeys.STATE_STORE_VERSION, snapshotVersion)} to " +
+    logInfo(log"Retrieved snapshot at version " +
+      log"${MDC(LogKeys.STATE_STORE_VERSION, snapshotVersion)} and apply delta files to version" +


nit: space after version, as the next string does not start with space.

HeartSaVioR · 2024-07-02T03:26:43Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

    val newMap = replayLoadedMapForStoreFromSnapshot(snapshotVersion, endVersion)
-    logInfo(log"Retrieved version ${MDC(LogKeys.STATE_STORE_VERSION, snapshotVersion)} to " +
+    logInfo(log"Retrieved snapshot at version " +
+      log"${MDC(LogKeys.STATE_STORE_VERSION, snapshotVersion)} and apply delta files to version" +


nit: same here

HeartSaVioR · 2024-07-02T03:38:24Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

+    }
+     */
+    val resourceUri = this.getClass.getResource(
+      s"/structured-streaming/checkpoint-version-4.0.0/$providerName/limit/"


Strictly saying, the test about checkpoint with no operator metadata to create operator metadata should have been done in state metadata testing. If we don't have one, we'd better to have one, but no need to couple with this PR.

HeartSaVioR · 2024-07-02T03:39:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreErrors.scala

    messageParameters = Map("errorMsg" -> errorMsg))
+
+class StateStoreProviderDoesNotSupportFineGrainedReplay(inputClass: String)
+ extends SparkUnsupportedOperationException(


before e, it's only one space.

eason-yuchen-liu · 2024-07-02T04:55:29Z

Thanks for all the careful checks by @HeartSaVioR @anishshri-db @WweiL. This PR is ready to merge.

HeartSaVioR

+1

HeartSaVioR · 2024-07-02T20:15:34Z

Thanks! Merging to master.

…to State Data Source ### What changes were proposed in this pull request? In #46944 and #47188, we introduced some new options to the State Data Source. This PR aims to explain these new features in the documentation. ### Why are the changes needed? It is necessary to reflect the latest change in the documentation website. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The API Doc website can be rendered correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47274 from eason-yuchen-liu/snapshot-doc. Authored-by: Yuchen Liu <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…to State Data Source ### What changes were proposed in this pull request? In apache#46944 and apache#47188, we introduced some new options to the State Data Source. This PR aims to explain these new features in the documentation. ### Why are the changes needed? It is necessary to reflect the latest change in the documentation website. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The API Doc website can be rendered correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47274 from eason-yuchen-liu/snapshot-doc. Authored-by: Yuchen Liu <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…rtitionId to state data source ### What changes were proposed in this pull request? This PR defines two new options, snapshotStartBatchId and snapshotPartitionId, for the existing state reader. Both of them should be provided at the same time. 1. When there is no snapshot file at `snapshotStartBatch` (note there is an off-by-one issue between version and batch Id), throw an exception. 2. Otherwise, the reader should continue to rebuild the state by reading delta files only, and ignore all snapshot files afterwards. 3. Note that if a `batchId` option is already specified. That batchId is the ending batchId, we should then end at that batchId. 4. This feature supports state generated by HDFS state store provider and RocksDB state store provider with changelog checkpointing enabled. **It does not support RocksDB with changelog disabled which is the default for RocksDB.** ### Why are the changes needed? Sometimes when a snapshot is corrupted, users want to bypass it when reading a later state. This PR gives user ability to specify the starting snapshot version and partition. This feature can be useful for debugging purpose. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Created test cases for testing edge cases for the input of new options. Created test for the new public function `replayReadStateFromSnapshot`. Created integration test for the new options against four stateful operators: limit, aggregation, deduplication, stream-stream join. Instead of generating states within the tests which is unstable, I prepare golden files for the integration test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46944 from eason-yuchen-liu/skipSnapshotAtBatch. Lead-authored-by: Yuchen Liu <[email protected]> Co-authored-by: Yuchen Liu <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…to State Data Source ### What changes were proposed in this pull request? In apache#46944 and apache#47188, we introduced some new options to the State Data Source. This PR aims to explain these new features in the documentation. ### Why are the changes needed? It is necessary to reflect the latest change in the documentation website. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The API Doc website can be rendered correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47274 from eason-yuchen-liu/snapshot-doc. Authored-by: Yuchen Liu <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…ly snapshotStartBatchId option) ### What changes were proposed in this pull request? This PR enables StateDataSource support with state checkpoint v2 format for the `snapshotStartBatchId` and related options, completing the StateDataSource checkpoint v2 integration. There is changes to the replayStateFromSnapshot method signature. `snapshotVersionStateStoreCkptId` and `endVersionStateStoreCkptId`. Both are needed as `snapshotVersionStateStoreCkptId` is used when getting the snapshot and `endVersionStateStoreCkptId` for calculating the full lineage from the final version. Before ``` def replayStateFromSnapshot( snapshotVersion: Long, endVersion: Long, readOnly: Boolean = false): StateStore ``` After ``` def replayStateFromSnapshot( snapshotVersion: Long, endVersion: Long, readOnly: Boolean = false): StateStore snapshotVersion: Long, endVersion: Long, readOnly: Boolean = false, snapshotVersionStateStoreCkptId: Option[String] = None, endVersionStateStoreCkptId: Option[String] = None): StateStore ``` This is the final PR in the series following: - #52047: Enable StateDataSource with state checkpoint v2 (only batchId option) - #52148: Enable StateDataSource with state checkpoint v2 (only readChangeFeed) NOTE: To read checkpoint v2 state data sources it is required to have `"spark.sql.streaming.stateStore.checkpointFormatVersion" -> 2`. It is possible to allow reading state data sources arbitrarily based on what is in the CommitLog by relaxing assertion checks but this is left as a future change. ### Why are the changes needed? State checkpoint v2 (`"spark.sql.streaming.stateStore.checkpointFormatVersion"`) introduces a new format for storing state metadata that includes unique identifiers in the file path for each state store. The existing StateDataSource implementation only worked with checkpoint v1 format, making it incompatible with streaming queries using the newer checkpoint format. Only `batchId` was implemented in #52047 and only `readChangeFeed` was implemented in #52148. ### Does this PR introduce _any_ user-facing change? Yes. State Data Source will work when checkpoint v2 is used and the `snapshotStartBatchId` and related options are used. ### How was this patch tested? In the previous PRs test suites were added to parameterize the current tests with checkpoint v2. All of these tests are now added back. All tests that previously intentionally tested some feature of the State Data Source Reader with checkpoint v1 should now be parameterized with checkpoint v2 (including python tests). `RocksDBWithCheckpointV2StateDataSourceReaderSnapshotSuite` is added which uses the golden file approach similar to #46944 where `snapshotStartBatchId` is first added. ### Was this patch authored or co-authored using generative AI tooling? No Closes #52202 from dylanwong250/SPARK-53332. Authored-by: Dylan Wong <[email protected]> Signed-off-by: Anish Shrigondekar <[email protected]>

eason-yuchen-liu and others added 13 commits June 4, 2024 15:28

initial implementation

6db0e3d

Merge branch 'skipSnapshotAtBatch' of https://github.com/eason-yuchen…

7dad0c1

…-liu/spark into skipSnapshotAtBatch

add test cases for two options in HDFS state store

2475173

allow rocksdb to reconstruct state from a specific checkpoint

07267b5

test directly on the method instead of end to end

9d902d7

Merge branch 'apache:master' into skipSnapshotAtBatch

eddb3c7

make sure test is stable

1a3d20a

delete useless test files

292ec5d

add new test on partition not found error

aa337c1

clean up and format

dfa712e

move partition error

4ebd078

improve doc

1656580

minor

61dea35

github-actions bot added SQL STRUCTURED STREAMING labels Jun 11, 2024

eason-yuchen-liu changed the title ~~[SPARK-48588][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source~~ [SPARK-48589][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source Jun 11, 2024

eason-yuchen-liu changed the title ~~[SPARK-48589][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source~~ [SPARK-48589][SQL][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source Jun 11, 2024

support reading join states

5229152

eason-yuchen-liu marked this pull request as ready for review June 12, 2024 20:23

WweiL reviewed Jun 13, 2024

View reviewed changes

anishshri-db reviewed Jun 13, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Jun 13, 2024

View reviewed changes

eason-yuchen-liu and others added 5 commits June 13, 2024 11:45

address reviews by Wei partially

4825215

address comments from Anish & Wei

20e1b9c

Merge branch 'master' into skipSnapshotAtBatch

9eb6c76

log StateSourceOptions optionally

4d4cd70

Merge branch 'skipSnapshotAtBatch' of https://github.com/eason-yuchen…

1870b35

…-liu/spark into skipSnapshotAtBatch

eason-yuchen-liu requested a review from WweiL June 13, 2024 21:26

HeartSaVioR reviewed Jun 27, 2024

View reviewed changes

eason-yuchen-liu added 5 commits June 27, 2024 11:05

rename to startVersion to snapshotVersion to make its function clear

e15213e

rename SupportsFineGrainedReplayFromSnapshot to SupportsFineGrainedRe…

42d952f

…play

reflect more comments from Jungtaek

6f1425d

throw the exception

4deb63e

provide the script to regenerate golden files

d140708

eason-yuchen-liu requested a review from HeartSaVioR June 27, 2024 22:58

anishshri-db reviewed Jun 28, 2024

View reviewed changes

anishshri-db approved these changes Jun 28, 2024

View reviewed changes

address comments from Anish

337785d

HeartSaVioR reviewed Jul 2, 2024

View reviewed changes

minor

9dbe295

eason-yuchen-liu requested a review from HeartSaVioR July 2, 2024 17:05

HeartSaVioR approved these changes Jul 2, 2024

View reviewed changes

HeartSaVioR closed this in ee0d306 Jul 2, 2024

eason-yuchen-liu mentioned this pull request Jul 9, 2024

[SPARK-48850][DOCS][SS][SQL] Add documentation for new options added to State Data Source #47274

Closed

dylanwong250 mentioned this pull request Sep 3, 2025

[SPARK-53332][SS] Enable StateDataSource with state checkpoint v2 (only snapshotStartBatchId option) #52202

Closed

[SPARK-48589][SQL][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source #46944

[SPARK-48589][SQL][SS] Add option snapshotStartBatchId and snapshotPartitionId to state data source #46944

Uh oh!

Conversation

eason-yuchen-liu commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

eason-yuchen-liu commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WweiL commented Jun 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eason-yuchen-liu Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

eason-yuchen-liu commented Jun 11, 2024 •

edited

Loading

eason-yuchen-liu commented Jun 12, 2024 •

edited

Loading

eason-yuchen-liu Jun 28, 2024 •

edited

Loading