[SPARK-19800][SS][WIP] Implement one kind of streaming sampling - reservoir sampling #17141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

uncleGen wants to merge 6 commits into apache:master from uncleGen:sampling

Contributor

uncleGen commented Mar 2, 2017 •

edited

Loading

What changes were proposed in this pull request?

This pr adds a special streaming sample operator to support sample. It adds a new evolving operator reservoir, and introduce an new logical plan ReservoirSample and two physical plan StreamingReservoirSampleExec and ReservoirSampleExec.

The following cases are supported:

batch table reservoir sampling
stream table reservoir sampling with/without aggregation and watermark in Update/Complete output mode

Not supported cases:

reservoir sampling in Append output mode (No meaning)

Followups:

move reservoir into sample operator

How was this patch tested?

add new unit tests.

uncleGen added 3 commits

March 2, 2017 21:52


          Implement one kind of streaming sampling, i.e. reservoir sampling

3c7dc19


          bug fix

23738cf


          update

288c124

Member

srowen commented Mar 2, 2017

Why does this need to be in Spark?

SparkQA commented Mar 2, 2017

Test build #73778 has finished for PR 17141 at commit 288c124.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Contributor Author

uncleGen commented Mar 3, 2017 •

edited

Loading

@srowen There are some unsupported operator for Structured Streaming. You can view here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala#L184
This pr adds support for sample operator, more exactly reservoir operator. But next step, I will try to combine reservoir into sample

Contributor Author

uncleGen commented Mar 3, 2017

cc @zsxwing and @tdas

uncleGen commented

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

    
                private val enc = Encoders.STRING.asInstanceOf[ExpressionEncoder[String]]

                private val NUM_RECORDS_IN_PARTITION = enc.toRow("NUM_RECORDS_IN_PARTITION")

                  .asInstanceOf[UnsafeRow]

Contributor Author

uncleGen Mar 3, 2017

NUM_RECORDS_IN_PARTITION calculate the total number of records in current partiton, and update at the end of sample.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

    
                            UnsafeProjection.create(withSumFieldTypes).apply(InternalRow.fromSeq(

                              new JoinedRow(kv._2, numRecordsTillNow)

                                .toSeq(withSumFieldTypes)))

                          }), {})

Contributor Author

uncleGen Mar 3, 2017

Here, we transfer the row to (row, numRecordsTillNow), and numRecordsTillNow is used to calculate the weight of item.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

    
                            .map(update => {

                              UnsafeProjection.create(withSumFieldTypes).apply(InternalRow.fromSeq(

                                  new JoinedRow(update.value, numRecordsTillNow)

                                    .toSeq(withSumFieldTypes)))

Contributor Author

uncleGen Mar 3, 2017

same

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

    
                          .apply(InternalRow.fromSeq(row.toSeq(fieldTypes)))

                      ).iterator

                  })

                }

Contributor Author

uncleGen Mar 3, 2017

here, we do once global weight reservoir sampling.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

    
                          store.put(replacementIdx, r.asInstanceOf[UnsafeRow])

                        }

                      }

                    }

Contributor Author

uncleGen Mar 3, 2017

In partiton, we just need to do once normal (without weight) reservoir sampling.

sql/core/src/test/scala/org/apache/spark/sql/streaming/ReservoirSampleSuit.scala

    
                    .groupBy("value").count()

                  assert(df.count() == 3, "")

                }

              }

Contributor Author

uncleGen Mar 3, 2017

new unit test needs to be improved.

Contributor Author

uncleGen commented Mar 7, 2017

ping @tdas and @zsxwing


          Merge branch 'master' into sampling

c4008cd

SparkQA commented Mar 9, 2017

Test build #74238 has finished for PR 17141 at commit c4008cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public static class LongWrapper
public static class IntWrapper
case class ResolveInlineTables(conf: CatalystConf) extends Rule[LogicalPlan]
case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] with PredicateHelper
case class JoinPlan(itemIds: Set[Int], plan: LogicalPlan, joinConds: Set[Expression], cost: Cost)
case class Cost(rows: BigInt, size: BigInt)
abstract class RepartitionOperation extends UnaryNode
case class FlatMapGroupsWithState(
class CSVOptions(
class UnivocityParser(
trait WatermarkSupport extends UnaryExecNode
case class FlatMapGroupsWithStateExec(


          Merge branch 'master' into sampling

1ddb82e

SparkQA commented Mar 20, 2017

Test build #74841 has finished for PR 17141 at commit 1ddb82e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.


          bug fix

02d44aa

SparkQA commented Mar 20, 2017

Test build #74843 has finished for PR 17141 at commit 02d44aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ReservoirSampleExec(reservoirSize: Int, child: SparkPlan) extends UnaryExecNode

Contributor

jiangxb1987 commented May 24, 2017

Is anyone still working on this or this could be closed? @uncleGen @zsxwing

jiangxb1987 mentioned this pull request

[INFRA] Close stale PRs #18223

Closed

asfgit closed this in

b771fed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet