[SPARK-16496][SQL] Add wholetext as option for reading text in SQL. #14151

ScrapCodes · 2016-07-12T10:45:45Z

What changes were proposed in this pull request?

In multiple text analysis problems, it is not often desirable for the rows to be split by "\n". There exists a wholeText reader for RDD API, and this JIRA just adds the same support for Dataset API.

How was this patch tested?

Added relevant new tests for both scala and Java APIs

SparkQA · 2016-07-12T12:32:59Z

Test build #62158 has finished for PR 14151 at commit bd2936d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text]
- class WholeTextFileFormat extends TextFileFormat

SparkQA · 2016-07-12T13:19:47Z

Test build #62161 has finished for PR 14151 at commit dafe981.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-12T17:16:29Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala

put this in the proper package?

This is currently in the same package as HadoopFileLineReader ? i.e. datasources. Should I move both of them to the package datasource.text ?

Looks like they might get used in multiple other formats too, what do you intend by proper package is unclear to me.

rxin · 2016-07-12T17:17:03Z

BTW instead of a whole new format, I think this should just be an option in the existing text format.

ScrapCodes · 2016-07-13T05:01:45Z

Actually what you said sounds like a nice idea, I was considering is it possible to propagate this as an option in all other formats like CSV and Json too ?

rxin · 2016-07-13T05:59:45Z

For now let's just do it for text file. I took a look - I guess it is ok to leave them in datasources for now.

SparkQA · 2016-07-13T11:47:45Z

Test build #62233 has finished for PR 14151 at commit 6e83f46.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text]

ScrapCodes · 2016-07-13T12:47:19Z

I have a question, should we keep a column with filenames ? in current approach we ignore key column.

SparkQA · 2016-07-14T11:48:42Z

Test build #62308 has finished for PR 14151 at commit 82952e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ScrapCodes · 2016-07-15T05:23:09Z

@rxin Do you think it looks okay now ?

ScrapCodes · 2016-08-09T04:44:09Z

@rxin Ping !

frreiss · 2016-08-15T18:01:39Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Should this really be a session-global configuration? It seems like something that is specific to a particular input file and should only be set when opening a given file.

Yea I don't think it should be a session wide config.

They are removed.

SparkQA · 2016-08-16T11:09:31Z

Test build #63839 has finished for PR 14151 at commit 2540018.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ScrapCodes · 2016-08-26T09:36:27Z

@rxin Ping !

gatorsmile · 2016-09-01T06:10:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala

Like what we did for csv and json, could you document this new option in DataFrameReader?

Actually, we might need to document this within readwriter.py too.

Good reminder ! @HyukjinKwon.

Move it to TextOptions?

ScrapCodes · 2016-09-01T09:15:01Z

Thanks @gatorsmile. I was actually wondering, where can I document this option.

SparkQA · 2016-09-01T11:05:21Z

Test build #64773 has finished for PR 14151 at commit 8ac37c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ScrapCodes · 2016-09-08T05:56:01Z

Hey @rxin, do you have further comments ?

SparkQA · 2016-09-30T07:17:43Z

Test build #66161 has finished for PR 14151 at commit 74a5f28.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text]

SparkQA · 2016-09-30T09:43:26Z

Test build #66164 has finished for PR 14151 at commit 3f8a177.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)

SparkQA · 2016-10-05T07:44:56Z

Test build #66375 has finished for PR 14151 at commit e263b15.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)

sameeragarwal · 2017-06-16T23:34:13Z

test this please

SparkQA · 2017-06-16T23:42:55Z

Test build #78197 has finished for PR 14151 at commit e263b15.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)

ScrapCodes · 2017-12-01T09:09:05Z

@viirya Can you please take another look?

gatorsmile · 2017-12-05T23:06:15Z

retest this please

SparkQA · 2017-12-06T01:53:02Z

Test build #84512 has finished for PR 14151 at commit da64f2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-12-06T03:38:59Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala

nit: // scalastyle:on nonascii

viirya · 2017-12-06T04:15:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala

We can avoid using var:

val reader = if (!wholeTextMode) { new HadoopFileLinesReader(file, confValue) } else { new HadoopFileWholeTextReader(file, confValue) }

viirya · 2017-12-06T04:20:28Z

python/pyspark/sql/readwriter.py

Can you add a doctest for wholetext too?

SparkQA · 2017-12-07T13:16:14Z

Test build #84602 has finished for PR 14151 at commit dd2ed3d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)

ScrapCodes · 2017-12-07T13:51:00Z

This python pydoc style is failing at [Row(value=u'hello\nthis')]. I could not find a way to fix it. Any help will be appreciated. It does not like the literal '\n'

HyukjinKwon · 2017-12-07T13:59:45Z

python/pyspark/sql/readwriter.py

        [Row(value=u'hello'), Row(value=u'this')]
+        >>> df = spark.read.text('python/test_support/sql/text-test.txt', wholetext=True)
+        >>> df.collect()
+        [Row(value=u'hello\nthis')]


Hm, can't we just do \\n?

That would fail the test, I suppose. I can give that a try though.

SparkQA · 2017-12-08T08:05:01Z

Test build #84645 has finished for PR 14151 at commit 7e91020.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-12-08T08:10:10Z

retest this please.

SparkQA · 2017-12-08T09:49:52Z

Test build #84648 has finished for PR 14151 at commit 7e91020.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-12-08T10:10:51Z

retest this please

SparkQA · 2017-12-08T12:59:41Z

Test build #84652 has finished for PR 14151 at commit 7e91020.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-12-08T13:25:33Z

Looks the escaping is ok.

gatorsmile · 2017-12-08T18:05:49Z

Since we expect users to use this one, instead of the RDD's wholeText reader. Could you add the new test cases from WholeTextFileRecordReaderSuite? Thanks!

…g RDD version of the option.

SparkQA · 2017-12-11T11:53:04Z

Test build #84704 has finished for PR 14151 at commit 66d5b45.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class WholeTextFileSuite extends QueryTest with SharedSQLContext

SparkQA · 2017-12-14T08:05:01Z

Test build #84896 has finished for PR 14151 at commit 021039b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ScrapCodes · 2017-12-14T09:15:42Z

retest this please

SparkQA · 2017-12-14T12:03:13Z

Test build #84905 has finished for PR 14151 at commit 021039b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-14T19:19:14Z

LGTM

gatorsmile · 2017-12-14T19:20:32Z

Thanks! Merged to master.

The code style issues will be addressed by my other PRs.

rxin reviewed Jul 12, 2016
View reviewed changes

ScrapCodes changed the title ~~[SPARK-16496][SQL] Add wholetext as data source for SQL.~~ [SPARK-16496][SQL] Add wholetext as option for reading text in SQL. Jul 13, 2016

ScrapCodes force-pushed the SPARK-16496/wholetext branch from dafe981 to 6e83f46 Compare July 13, 2016 09:55

frreiss reviewed Aug 15, 2016
View reviewed changes

gatorsmile reviewed Sep 1, 2016
View reviewed changes

ScrapCodes force-pushed the SPARK-16496/wholetext branch from 8ac37c1 to 74a5f28 Compare September 30, 2016 07:09

ScrapCodes force-pushed the SPARK-16496/wholetext branch from 3f8a177 to e263b15 Compare October 5, 2016 05:38

HyukjinKwon mentioned this pull request Oct 17, 2016

[SPARK-17969]I think it's user unfriendly to process standard json file with DataFrame #15511

Closed

ScrapCodes force-pushed the SPARK-16496/wholetext branch from e263b15 to cab3323 Compare June 28, 2017 11:30

viirya reviewed Dec 6, 2017

View reviewed changes

python/pyspark/sql/readwriter.py Outdated

Copy link

Member

viirya Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a doctest for wholetext too?

[SPARK-16496][SQL] Add wholetext as option for reading text in SQL.

dd2ed3d

ScrapCodes force-pushed the SPARK-16496/wholetext branch from da64f2d to dd2ed3d Compare December 7, 2017 13:07

HyukjinKwon reviewed Dec 7, 2017

View reviewed changes

Try out escaping slash.

7e91020

Added a WholeTextFileSuite, covering more cases from the correspondin…

66d5b45

…g RDD version of the option.

fixed tests

021039b

ScrapCodes force-pushed the SPARK-16496/wholetext branch from 989ab94 to 021039b Compare December 14, 2017 06:35

asfgit closed this in 40de176 Dec 14, 2017

[SPARK-16496][SQL] Add wholetext as option for reading text in SQL. #14151

[SPARK-16496][SQL] Add wholetext as option for reading text in SQL. #14151

Uh oh!

Conversation

ScrapCodes commented Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 12, 2016

Uh oh!

SparkQA commented Jul 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScrapCodes Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 12, 2016

Uh oh!

ScrapCodes commented Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

ScrapCodes commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

ScrapCodes commented Jul 15, 2016

Uh oh!

ScrapCodes commented Aug 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 16, 2016

Uh oh!

ScrapCodes commented Aug 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScrapCodes commented Sep 1, 2016

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

ScrapCodes commented Sep 8, 2016

Uh oh!

SparkQA commented Sep 30, 2016

Uh oh!

SparkQA commented Sep 30, 2016

Uh oh!

SparkQA commented Oct 5, 2016

Uh oh!

sameeragarwal commented Jun 16, 2017

Uh oh!

SparkQA commented Jun 16, 2017

Uh oh!

ScrapCodes commented Dec 1, 2017

Uh oh!

gatorsmile commented Dec 5, 2017

Uh oh!

SparkQA commented Dec 6, 2017

Uh oh!

ScrapCodes commented Jul 12, 2016 •

edited

Loading

ScrapCodes Jul 13, 2016 •

edited

Loading

ScrapCodes commented Jul 13, 2016 •

edited

Loading

HyukjinKwon Sep 30, 2016 •

edited

Loading

viirya Dec 6, 2017 •

edited

Loading

gatorsmile commented Dec 8, 2017 •

edited

Loading