[SPARK-25684][SQL] Organize header related codes in CSV datasource #22676

HyukjinKwon · 2018-10-09T07:55:43Z

What changes were proposed in this pull request?

Move CSVDataSource.makeSafeHeader to CSVUtils.makeSafeHeader (as is).
- Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc.
- See JsonDataSource. Now CSVDataSource is quite consistent with JsonDataSource. Since CSV's code path is quite complicated, we might better match them as possible as we can.
Create CSVHeaderChecker and put enforceSchema logics into that.
- The checking header and column pruning stuff were added (per [SPARK-23786][SQL] Checking column names of csv headers #20894 and [SPARK-24244][SQL] Passing only required columns to the CSV parser #21296) but some of codes such as [SPARK-25134][SQL] Csv column pruning with checking of headers throws incorrect error #22123 are duplicated
- Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See ([SPARK-25669][SQL] Check CSV header only when it exists #22656).
Move CSVDataSource.checkHeaderColumnNames to CSVHeaderChecker.checkHeaderColumnNames (as is).
- Similar reasons above with 1.

How was this patch tested?

Existing tests should cover this.

HyukjinKwon · 2018-10-09T07:56:28Z

cc @cloud-fan since you're looking at JSON related codes, and reviewed some of related PRs and @MaxGekk since you're looking into this area.

HyukjinKwon · 2018-10-09T07:57:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

      tokenizer: CsvParser): Iterator[Array[String]] = {
-    convertStream(inputStream, shouldDropHeader, tokenizer)(tokens => tokens)
+    val handleHeader: () => Unit =
+      () => if (shouldDropHeader) tokenizer.parseNext


This is used in schema inference path, where we don't check header. Here only it drops the header.

HyukjinKwon · 2018-10-09T07:58:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

-    convertStream(inputStream, shouldDropHeader, tokenizer, checkHeader) { tokens =>
+
+    val handleHeader: () => Unit =
+      () => headerChecker.checkHeaderColumnNames(tokenizer)


This matches the code structure with parseStream and parseIterator which are used in multimode and non-multimode.

HyukjinKwon · 2018-10-09T07:58:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala

+  /**
+   * Generates a header from the given row which is null-safe and duplicate-safe.
+   */
+  def makeSafeHeader(


It's moved as was.

HyukjinKwon · 2018-10-09T07:58:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVHeaderChecker.scala

+   *
+   * @param columnNames names of CSV columns that must be checked against to the schema.
+   */
+  private def checkHeaderColumnNames(columnNames: Array[String]): Unit = {


It's moved as was except the parameters at its signature.

HyukjinKwon · 2018-10-09T08:00:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVHeaderChecker.scala

+                  | Header: ${columnNames.mkString(", ")}
+                  | Schema: ${fieldNames.mkString(", ")}
+                  |Expected: ${fieldNames(i)} but found: ${columnNames(i)}
+                  |$source""".stripMargin)


only this diff.

Previously it was

|CSV file: $fileName""".stripMargin)

which ends up with producing the class of source here. See (https://github.com/apache/spark/pull/22676/files#diff-f70bda59304588cc3abfa3a9840653f4R512)

This is only the diff in this method.

SparkQA · 2018-10-09T11:19:09Z

Test build #97149 has finished for PR 22676 at commit 89f7911.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-09T11:21:26Z

Test build #97147 has finished for PR 22676 at commit 5690668.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CSVHeaderChecker(

srowen · 2018-10-09T14:24:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVHeaderChecker.scala

+ *                      if unknown or not applicable (for instance when the input is a dataset),
+ *                      can be omitted.
+ */
+class CSVHeaderChecker(


Can this be private to csv or spark packages? or is this now part of a public API?

It's under execution package which is meant to be private. Since it's accessed in DataFrameReader, it should be private[sql] which is removed in SPARK-16964 for this reason.

MaxGekk

Definitely it looks better.

MaxGekk · 2018-10-09T13:26:19Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

-        parsedOptions.enforceSchema,
-        sparkSession.sessionState.conf.caseSensitiveAnalysis)
+        parsedOptions,
+        source = s"CSV source: ${csvDataset.getClass.getCanonicalName}")


Is it better to output more concrete info about the dataset. For example, toString outputs field names at least. I think it will help in log analysis.

Makes sense. If that's just toString, of course I can fix it here since the change is small although it's orthogonal.

MaxGekk · 2018-10-09T14:33:07Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

-    } else {
-      filteredLines.rdd
-    }
+    }.getOrElse(filteredLines.rdd)


It is not directly related to your changes. Just in case, why do we convert Dataset to RDD here?

I don't exactly remember. Looks we can change it to Dataset.

MaxGekk · 2018-10-09T14:37:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVHeaderChecker.scala

+ *                      if unknown or not applicable (for instance when the input is a dataset),
+ *                      can be omitted.
+ */
+class CSVHeaderChecker(


Is this prefix of CSVHeaderChecker necessary? The class is in csv package already. It should be clear that it checks CSV headers.

Let's leave as is. It's kind of existing naming convention within each datasource.

MaxGekk · 2018-10-09T14:39:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

        parsedOptions)
+      val schema = if (columnPruning) requiredSchema else dataSchema
+      val headerChecker = new CSVHeaderChecker(
+        schema, parsedOptions, source = s"CSV file: ${file.filePath}", file.start == 0)


isStartOfFile = file.start == 0

MaxGekk · 2018-10-09T14:55:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

-    }
+
+    // We can handle header here since here the stream is open.
+    handleHeader()


It looks slightly strange that we consume data from the input before the upper layer starts reading it.

It is but I guess it was already doing in this way.

MaxGekk · 2018-10-09T15:04:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

      parser: UnivocityParser,
+      headerChecker: CSVHeaderChecker,
      schema: StructType): Iterator[InternalRow] = {
+    headerChecker.checkHeaderColumnNames(lines, parser.tokenizer)


The same question here. I would prefer to consume the input iterator lazily. This is the one of advantage of iterators , it performs an action when you explicitly call it (hasNext or next) comparing to collections, for example.

ditto. It was already doing in this way. Let's keep the original path as is since it targets to organize it..

MaxGekk · 2018-10-09T15:06:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

      case Some(firstRow) if firstRow != null =>
        val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
-        val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
+        val header = CSVUtils.makeSafeHeader(firstRow, caseSensitive, parsedOptions)


What about to import it from CSVUtils? What is the reason to have the prefix here?

Because mostly in this codes use CSVUtils... one. I just followed it.

SparkQA · 2018-10-09T19:12:32Z

Test build #97162 has finished for PR 22676 at commit c504356.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-11T05:20:23Z

retest this please

SparkQA · 2018-10-11T07:05:02Z

Test build #97235 has finished for PR 22676 at commit c504356.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-11T07:09:56Z

retest this please

SparkQA · 2018-10-11T10:42:14Z

Test build #97240 has finished for PR 22676 at commit c504356.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-12T01:16:21Z

Merged to master.

HyukjinKwon · 2018-10-12T01:16:49Z

Thank you @cloud-fan and @MaxGekk for reviewing this.

## What changes were proposed in this pull request? 1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is). - Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc. - See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can. 2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that. - The checking header and column pruning stuff were added (per apache#20894 and apache#21296) but some of codes such as apache#22123 are duplicated - Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (apache#22656). 3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is). - Similar reasons above with 1. ## How was this patch tested? Existing tests should cover this. Closes apache#22676 from HyukjinKwon/refactoring-csv. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

Organize header related codes in CSV datasource

5690668

HyukjinKwon commented Oct 9, 2018

View reviewed changes

Remove unused import

89f7911

cloud-fan approved these changes Oct 9, 2018

View reviewed changes

srowen reviewed Oct 9, 2018

View reviewed changes

MaxGekk reviewed Oct 9, 2018

View reviewed changes

Address Max's comments

c504356

asfgit closed this in 39872af Oct 12, 2018

HyukjinKwon deleted the refactoring-csv branch October 16, 2018 12:41

[SPARK-25684][SQL] Organize header related codes in CSV datasource #22676

[SPARK-25684][SQL] Organize header related codes in CSV datasource #22676

Uh oh!

Conversation

HyukjinKwon commented Oct 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Oct 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 9, 2018

Uh oh!

SparkQA commented Oct 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 9, 2018

Uh oh!

HyukjinKwon commented Oct 11, 2018

Uh oh!

SparkQA commented Oct 11, 2018

Uh oh!

HyukjinKwon commented Oct 11, 2018

Uh oh!

SparkQA commented Oct 11, 2018

Uh oh!

HyukjinKwon commented Oct 12, 2018

Uh oh!

HyukjinKwon commented Oct 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

HyukjinKwon commented Oct 9, 2018 •

edited

Loading

HyukjinKwon commented Oct 9, 2018 •

edited

Loading