-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16496][SQL] Add wholetext as option for reading text in SQL. #14151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #62158 has finished for PR 14151 at commit
|
|
Test build #62161 has finished for PR 14151 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this in the proper package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently in the same package as HadoopFileLineReader ? i.e. datasources. Should I move both of them to the package datasource.text ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like they might get used in multiple other formats too, what do you intend by proper package is unclear to me.
|
BTW instead of a whole new format, I think this should just be an option in the existing text format. |
|
Actually what you said sounds like a nice idea, I was considering is it possible to propagate this as an option in all other formats like CSV and Json too ? |
|
For now let's just do it for text file. I took a look - I guess it is ok to leave them in datasources for now. |
dafe981 to
6e83f46
Compare
|
Test build #62233 has finished for PR 14151 at commit
|
|
I have a question, should we keep a column with filenames ? in current approach we ignore key column. |
|
Test build #62308 has finished for PR 14151 at commit
|
|
@rxin Do you think it looks okay now ? |
|
@rxin Ping ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really be a session-global configuration? It seems like something that is specific to a particular input file and should only be set when opening a given file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I don't think it should be a session wide config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are removed.
|
Test build #63839 has finished for PR 14151 at commit
|
|
@rxin Ping ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like what we did for csv and json, could you document this new option in DataFrameReader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we might need to document this within readwriter.py too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good reminder ! @HyukjinKwon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it to TextOptions?
|
Thanks @gatorsmile. I was actually wondering, where can I document this option. |
|
Test build #64773 has finished for PR 14151 at commit
|
|
Hey @rxin, do you have further comments ? |
8ac37c1 to
74a5f28
Compare
|
Test build #66161 has finished for PR 14151 at commit
|
|
Test build #66164 has finished for PR 14151 at commit
|
3f8a177 to
e263b15
Compare
|
Test build #66375 has finished for PR 14151 at commit
|
|
test this please |
|
Test build #78197 has finished for PR 14151 at commit
|
e263b15 to
cab3323
Compare
|
@viirya Can you please take another look? |
|
retest this please |
|
Test build #84512 has finished for PR 14151 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // scalastyle:on nonascii
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid using var:
val reader = if (!wholeTextMode) {
new HadoopFileLinesReader(file, confValue)
} else {
new HadoopFileWholeTextReader(file, confValue)
}
python/pyspark/sql/readwriter.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a doctest for wholetext too?
da64f2d to
dd2ed3d
Compare
|
Test build #84602 has finished for PR 14151 at commit
|
|
This python pydoc style is failing at |
python/pyspark/sql/readwriter.py
Outdated
| [Row(value=u'hello'), Row(value=u'this')] | ||
| >>> df = spark.read.text('python/test_support/sql/text-test.txt', wholetext=True) | ||
| >>> df.collect() | ||
| [Row(value=u'hello\nthis')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, can't we just do \\n?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would fail the test, I suppose. I can give that a try though.
|
Test build #84645 has finished for PR 14151 at commit
|
|
retest this please. |
|
Test build #84648 has finished for PR 14151 at commit
|
|
retest this please |
|
Test build #84652 has finished for PR 14151 at commit
|
|
Looks the escaping is ok. |
|
Since we expect users to use this one, instead of the RDD's wholeText reader. Could you add the new test cases from |
…g RDD version of the option.
|
Test build #84704 has finished for PR 14151 at commit
|
989ab94 to
021039b
Compare
|
Test build #84896 has finished for PR 14151 at commit
|
|
retest this please |
|
Test build #84905 has finished for PR 14151 at commit
|
|
LGTM |
|
Thanks! Merged to master. The code style issues will be addressed by my other PRs. |
What changes were proposed in this pull request?
In multiple text analysis problems, it is not often desirable for the rows to be split by "\n". There exists a wholeText reader for RDD API, and this JIRA just adds the same support for Dataset API.
How was this patch tested?
Added relevant new tests for both scala and Java APIs