[SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV #16854

HyukjinKwon · 2017-02-08T15:08:11Z

What changes were proposed in this pull request?

This PR proposes to add an API that loads DataFrame from Dataset[String] storing csv.

It allows pre-processing before loading into CSV, which means allowing a lot of workarounds for many narrow cases, for example, as below:

Case 1 - pre-processing

val df = spark.read.text("...")
// Pre-processing with this.
spark.read.csv(df.as[String])

Case 2 - use other input formats

val rdd = spark.sparkContext.newAPIHadoopFile("/file.csv.lzo",
  classOf[com.hadoop.mapreduce.LzoTextInputFormat],
  classOf[org.apache.hadoop.io.LongWritable],
  classOf[org.apache.hadoop.io.Text])
val stringRdd = rdd.map(pair => new String(pair._2.getBytes, 0, pair._2.getLength))

spark.read.csv(stringRdd.toDS)

How was this patch tested?

Added tests in CSVSuite and build with Scala 2.10.

./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package

HyukjinKwon · 2017-02-08T15:12:12Z

Let me try to fix comments more and double check tomorrow.

SparkQA · 2017-02-08T15:14:31Z

Test build #72588 has finished for PR 16854 at commit eabb3f3.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class UnivocityParser(

HyukjinKwon · 2017-02-09T04:35:37Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Just to help review, there is a similar code path in

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

Lines 105 to 125 in 3d314d0

val lines = {

val conf = broadcastedHadoopConf.value.value

val linesReader = new HadoopFileLinesReader(file, conf)

Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => linesReader.close()))

linesReader.map { line =>

new String(line.getBytes, 0, line.getLength, csvOptions.charset)

}

}

val linesWithoutHeader = if (csvOptions.headerFlag && file.start == 0) {

// Note that if there are only comments in the first block, the header would probably

// be not dropped.

CSVUtils.dropHeaderLine(lines, csvOptions)

} else {

lines

}

val filteredLines = CSVUtils.filterCommentAndEmpty(linesWithoutHeader, csvOptions)

val parser = new UnivocityParser(dataSchema, requiredSchema, csvOptions)

filteredLines.flatMap(parser.parse)

}

SparkQA · 2017-02-09T06:35:16Z

Test build #72623 has finished for PR 16854 at commit a7e8c2b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class UnivocityParser(

HyukjinKwon · 2017-02-09T09:46:31Z

Cc @cloud-fan, do you mind if I ask whether you think it is worth adding this API?

cloud-fan · 2017-02-10T20:04:47Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

shall we also add def json(lines: Dataset[String])? and deprecate json(r: RDD[String]) and json(r: JavaRDD[String])

cc @rxin

Sure. Actually, there is a JIRA and closed PR, #13460 and SPARK-15615 where I was negative because it can be easily worked around.

However, I am fine if we are promoting to use datasets instead of RDDs for some advantages like SPARK-18362 (if applicable).

cc @pjfanning, could you reopen and proceed your PR if we are all fine?

@HyukjinKwon I can look at resurrecting the pull request for SPARK-15615

@HyukjinKwon I added a new pull request because my original branch was deleted. #16895

HyukjinKwon · 2017-02-24T10:26:11Z

Let me update this after #16976 gets merged as that changes the related code path rapidly.

HyukjinKwon · 2017-03-02T10:39:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

cc @cloud-fan, this is still a wip but I am trying to put the different execution paths into here in CSV parsing.

For example,

spark.read.csv(file)

data: parseIterator (note that this one is read from partitioned file).

schema: tokenizeDataset

spark.read.csv(file) with wholeFile

data: parseStream

schema: tokenizeStream

spark.read.csv(dataset)

data: parseDataset

schema: tokenizeDataset

However, it seems ending up with a bit weird arguments here.. do you think it is okay?

If you are not sure or it looks painful to review, let me take out all the changes and put those into DataFrameReader.csv for now. Otherwise, I will take a look further and see if I can maybe generalise these more rather than just putting together.

SparkQA · 2017-03-02T11:43:53Z

Test build #73756 has finished for PR 16854 at commit de492b3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait CSVDataSource extends Serializable

cloud-fan · 2017-03-03T20:04:08Z

does def csv(csvDataset: Dataset[String]) need to support whole file? I think the JSON one doesn't support it either.

HyukjinKwon · 2017-03-04T04:40:24Z

Oh, no. It does not need to.

I just meant to de-duplicate some logics by #16854 (comment). Let me just remove that part and leave only code changes dedicated for this JIRA. It seems making reviewers confused. Let me clean up soon.

HyukjinKwon · 2017-03-04T15:21:26Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

The reason why this one exists unlike json is, CSV needs to head a read always first (even if it does not infer the schema, it needs at least the number of values). In this case, we could return empty one fast.

HyukjinKwon · 2017-03-04T15:22:05Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

There is a similar code path in

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

Lines 132 to 150 in 7e5359b

override def infer(

sparkSession: SparkSession,

inputPaths: Seq[FileStatus],

parsedOptions: CSVOptions): Option[StructType] = {

val csv: Dataset[String] = createBaseDataset(sparkSession, inputPaths, parsedOptions)

val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, parsedOptions).first()

val firstRow = new CsvParser(parsedOptions.asParserSettings).parseLine(firstLine)

val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis

val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)

val tokenRDD = csv.rdd.mapPartitions { iter =>

val filteredLines = CSVUtils.filterCommentAndEmpty(iter, parsedOptions)

val linesWithoutHeader =

CSVUtils.filterHeaderLine(filteredLines, firstLine, parsedOptions)

val parser = new CsvParser(parsedOptions.asParserSettings)

linesWithoutHeader.map(parser.parseLine)

}

Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))

}

then can we just call TextInputCSVDataSource.infer here?

Oh, sorry, I overlooked. It seems TextInputCSVDataSource.infer takes input paths whereas we want Dataset here. Let me try to take a look and see if we could reuse it.

HyukjinKwon · 2017-03-04T15:25:06Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Maybe, this is too much. I am willing to revert this back.

HyukjinKwon · 2017-03-04T15:25:16Z

Let me double check before getting rid of [WIP] tomorrow.

SparkQA · 2017-03-04T17:09:22Z

Test build #73902 has finished for PR 16854 at commit 859113a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CSVOptions(
class UnivocityParser(

SparkQA · 2017-03-04T17:15:28Z

Test build #73903 has finished for PR 16854 at commit b806698.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-03-05T10:44:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

makeSafeHeader was moved from CSVDataSource class to CSVDataSource companion object so that this can be accessed in DataFrameReader.

SparkQA · 2017-03-05T12:32:52Z

Test build #73927 has finished for PR 16854 at commit aed003e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-05T12:37:12Z

Test build #73928 has finished for PR 16854 at commit de08313.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class CSVDataSource extends Serializable

HyukjinKwon · 2017-03-06T15:12:54Z

@cloud-fan, I think this is ready for another look.

SparkQA · 2017-03-06T16:34:49Z

Test build #74017 has finished for PR 16854 at commit af9bc6f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-06T16:38:46Z

Test build #74011 has finished for PR 16854 at commit 3a0401a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-03-07T12:09:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

+  /**
+   * Infers the schema from `Dataset` that stores CSV string records.
+   */
+  def inferFromDataset(


There is almost no code modification here. Just moved from above.

SparkQA · 2017-03-07T12:56:34Z

Test build #74092 has finished for PR 16854 at commit 92dfdf9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T13:46:20Z

Test build #74098 has finished for PR 16854 at commit a2739fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T14:01:46Z

Test build #74101 has finished for PR 16854 at commit a14df70.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T14:06:07Z

Test build #74102 has finished for PR 16854 at commit a0a79dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-07T19:38:22Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+
+  test("Empty dataframe produces empty dataframe") {
+    // Empty dataframe with schema.
+    val emptyDF = spark.createDataFrame(


why we create emptyDF? looks like we only need a schema here.

Sure, let me fix it up.

SparkQA · 2017-03-08T11:53:15Z

Test build #74200 has finished for PR 16854 at commit 3f42c4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-08T21:43:33Z

thanks, merging to master!

gatorsmile · 2017-06-05T01:50:05Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+      CSVUtils.filterCommentAndEmpty(csvDataset, parsedOptions)
+    val maybeFirstLine: Option[String] = filteredLines.take(1).headOption
+
+    val schema = userSpecifiedSchema.getOrElse {


We should issue an error when users try to parse it as a wholeFile.

Need to check whether all the other CSV options are still accepted by this API.

@gatorsmile, Yes, we need to check. For JSON API too. Though, should we throws an error? It reminds me of parse modes in from_json/to_json that ignore parse modes.

We should not simply ignore the options without error messages. The options are not like hints.

HyukjinKwon changed the title ~~[SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String]~~ [WIP][SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] Feb 8, 2017

HyukjinKwon changed the title ~~[WIP][SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String]~~ [WIP][SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV Feb 8, 2017

HyukjinKwon force-pushed the SPARK-15463 branch from eabb3f3 to a7e8c2b Compare February 9, 2017 04:12

HyukjinKwon commented Feb 9, 2017

View reviewed changes

HyukjinKwon changed the title ~~[WIP][SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV~~ [SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV Feb 9, 2017

cloud-fan reviewed Feb 10, 2017

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV~~ [WIP][SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV Mar 2, 2017

HyukjinKwon force-pushed the SPARK-15463 branch from a7e8c2b to de492b3 Compare March 2, 2017 10:19

HyukjinKwon commented Mar 2, 2017

View reviewed changes

HyukjinKwon force-pushed the SPARK-15463 branch from de492b3 to 859113a Compare March 4, 2017 15:16

HyukjinKwon commented Mar 4, 2017

View reviewed changes

HyukjinKwon commented Mar 5, 2017

View reviewed changes

HyukjinKwon changed the title ~~[WIP][SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV~~ [SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV Mar 5, 2017

HyukjinKwon changed the title ~~[SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV~~ [WIP][SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV Mar 6, 2017

HyukjinKwon changed the title ~~[WIP][SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV~~ [SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV Mar 6, 2017

Add an API to load DataFrame from Dataset[String] storing CSV

5477c80

HyukjinKwon force-pushed the SPARK-15463 branch from af9bc6f to 5477c80 Compare March 7, 2017 10:57

HyukjinKwon added 6 commits March 7, 2017 19:58

Add a comment

477fc98

Remove unused import

92dfdf9

Cleaner

a2739fd

Add some more comments and make it cleaner

abae589

Fix test title

a14df70

Fix title

a0a79dc

HyukjinKwon commented Mar 7, 2017

View reviewed changes

cloud-fan reviewed Mar 7, 2017

View reviewed changes

Addresss comment on the test

3f42c4c

asfgit closed this in 4551290 Mar 8, 2017

gatorsmile reviewed Jun 5, 2017

View reviewed changes

HyukjinKwon deleted the SPARK-15463 branch January 2, 2018 03:44

	val lines = {
	val conf = broadcastedHadoopConf.value.value
	val linesReader = new HadoopFileLinesReader(file, conf)
	Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => linesReader.close()))
	linesReader.map { line =>
	new String(line.getBytes, 0, line.getLength, csvOptions.charset)
	}
	}

	val linesWithoutHeader = if (csvOptions.headerFlag && file.start == 0) {
	// Note that if there are only comments in the first block, the header would probably
	// be not dropped.
	CSVUtils.dropHeaderLine(lines, csvOptions)
	} else {
	lines
	}

	val filteredLines = CSVUtils.filterCommentAndEmpty(linesWithoutHeader, csvOptions)
	val parser = new UnivocityParser(dataSchema, requiredSchema, csvOptions)
	filteredLines.flatMap(parser.parse)
	}

	override def infer(
	sparkSession: SparkSession,
	inputPaths: Seq[FileStatus],
	parsedOptions: CSVOptions): Option[StructType] = {
	val csv: Dataset[String] = createBaseDataset(sparkSession, inputPaths, parsedOptions)
	val firstLine: String = CSVUtils.filterCommentAndEmpty(csv, parsedOptions).first()
	val firstRow = new CsvParser(parsedOptions.asParserSettings).parseLine(firstLine)
	val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
	val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
	val tokenRDD = csv.rdd.mapPartitions { iter =>
	val filteredLines = CSVUtils.filterCommentAndEmpty(iter, parsedOptions)
	val linesWithoutHeader =
	CSVUtils.filterHeaderLine(filteredLines, firstLine, parsedOptions)
	val parser = new CsvParser(parsedOptions.asParserSettings)
	linesWithoutHeader.map(parser.parseLine)
	}

	Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
	}

[SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV #16854

[SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV #16854

Uh oh!

Conversation

HyukjinKwon commented Feb 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Feb 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 9, 2017

Uh oh!

HyukjinKwon commented Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

cloud-fan commented Mar 3, 2017

Uh oh!

HyukjinKwon commented Mar 4, 2017

Uh oh!

HyukjinKwon Mar 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 4, 2017

Uh oh!

SparkQA commented Mar 4, 2017

Uh oh!

SparkQA commented Mar 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 5, 2017

Uh oh!

SparkQA commented Mar 5, 2017

Uh oh!

HyukjinKwon commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

SparkQA commented Mar 7, 2017

HyukjinKwon commented Feb 8, 2017 •

edited

Loading

HyukjinKwon commented Feb 8, 2017 •

edited

Loading

HyukjinKwon commented Feb 9, 2017 •

edited

Loading

HyukjinKwon Feb 11, 2017 •

edited

Loading

HyukjinKwon Mar 2, 2017 •

edited

Loading

HyukjinKwon Mar 4, 2017 •

edited

Loading