[SPARK-25133][SQL][Doc]Avro data source guide #22121

Here I am following https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying .
Using --packages ensures that this library and its dependencies will be added to the classpath, which should be good enough for general users.
For users build their jar, they are supposed to know the general option --jars.
I can add it if you insist.

ok. When I see a deploying section I would expect it to tell me what my options are so perhaps just rephrasing to more indicate --packages is one way to do it.

It would be nice to at least have a general statement saying the external modules aren't including with spark by default, the user must include the necessary jars themselves. The way to do this will be deployment specific. One way of doing this is via the --packages option.

Note I think the structured-streaming-kafka section should ideally be updated to something similar as well. And really any external module for that matter. It would be nice to tell users how they can include these without assuming they just know how to.

Actually the --jars option is well explained in https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management . And the doc url is also mentioned in both Deploying sections.
I still feel it is unnecessary to have a short introduction about --jars option here.

tgravescs · 2018-08-17T13:58:34Z

docs/avro-data-source-guide.md

I think this should be higher up not in the examples section. Perhaps in its own compatibility section.

I see. I can change the title as read/write Avro data...Thanks

tgravescs · 2018-08-17T14:04:18Z

docs/avro-data-source-guide.md

the configuration here has not spark. prefix? this is set via the .option interface?
I think we should clarify that for the user vs later in the table you have the spark. configs that I assume aren't set via option but via --conf

srowen · 2018-08-17T14:12:18Z

docs/sql-programming-guide.md

Nit: I think it's just called "Avro", and we should call it "Apache Avro" here.

srowen · 2018-08-17T14:12:57Z

docs/avro-data-source-guide.md

Call it "Apache Avro" in the title and first mention in the paragraph below. Afterwards, just "Avro" is OK.

srowen · 2018-08-17T14:13:33Z

docs/avro-data-source-guide.md

You can use back-ticks rather than <code> for simpler code formatting. No big deal either way.

srowen · 2018-08-17T14:13:56Z

docs/avro-data-source-guide.md

Space after headings like this

gatorsmile · 2018-08-17T16:54:28Z

docs/avro-data-source-guide.md

support -> built-in support

gatorsmile · 2018-08-17T17:08:42Z

We should do the same thing for the other native sources.

gatorsmile · 2018-08-17T17:11:24Z

We also need to document the extra enhancements that are added in this release, compared with the databricks/spark-avro package.

gengliangwang · 2018-08-17T19:16:47Z

@tgravescs @srowen @gatorsmile Thanks for the reviewing. I will keep updating this.

dongjoon-hyun · 2018-08-17T19:28:37Z

docs/avro-data-source-guide.md

Could you remove the repetition, line 191 ~ 195?

It is a mistake. Thanks for pointing out!

dongjoon-hyun · 2018-08-17T19:31:20Z

docs/avro-data-source-guide.md

is not -> are not applied?

SparkQA · 2018-08-17T19:33:45Z

Test build #94905 has finished for PR 22121 at commit ff6d3ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-18T17:53:10Z

Test build #94928 has finished for PR 22121 at commit 8b191bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-08-19T14:54:19Z

@gatorsmile does this address your comment about documenting new features?

gengliangwang · 2018-08-19T16:48:37Z

@srowen Hi Sean, I will add content for new features soon. I also updated the title.
Thanks.

HyukjinKwon · 2018-08-20T01:58:30Z

docs/avro-data-source-guide.md

@gengliangwang, I could check it by myself but thought it's easier to ask to you. Do we now all have the options and configurations existent in spark-avro?

For data source options, yes.
For SQL configuration, I think the only one matters is the one in #22133. I am thinking of a better name for that configuration.

I will add a section for the SQL configurations.

tgravescs · 2018-08-21T21:58:28Z

docs/avro-data-source-guide.md

note I think we should add a compatibility section for compatibliity with databricks avro version here, reference #22133

@tgravescs I have added an independent section for it :)

gengliangwang · 2018-08-22T08:52:28Z

@srowen @tgravescs @gatorsmile @HyukjinKwon @dongjoon-hyun Thanks for the reviews! I have added section to_avro() and from_avro() and Compatibility with Databricks spark-avro.

Also attach html file for preview, please check it in PR description.

SparkQA · 2018-08-22T09:03:58Z

Test build #95099 has finished for PR 22121 at commit d2681ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-22T09:44:15Z

Test build #95100 has finished for PR 22121 at commit d9c5352.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-08-22T12:49:28Z

docs/avro-data-source-guide.md

+// 2. Filter by column `favorite_color`;
+// 3. Encode the column `name` in Avro format.
+DataFrame output = df
+  .select(from_avro(col("value"), jsonFormatSchema).as("user"))


cc @arunmahadevan

tgravescs · 2018-08-22T13:45:26Z

docs/avro-data-source-guide.md

+
+## Load and Save Functions
+
+Since `spark-avro` module is external, there is not such API as `.avro` in 


there is no '.avro' API in

SparkQA · 2018-08-22T14:47:38Z

Test build #95105 has finished for PR 22121 at commit 8da8250.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-22T14:50:07Z

docs/avro-data-source-guide.md

+</div>
+
+## to_avro() and from_avro()
+Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.


not "Spark SQL", it should be "The Avro package"

encode a struct as a string, I think it's not "string", but "binary"?

does it need to be a struct or any spark sql type?
maybe: to_avro to encode spark sql types as avro bytes and from_avro to retrieve avro bytes as spark sql types?

cloud-fan · 2018-08-22T14:52:30Z

docs/avro-data-source-guide.md

+* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
+* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.
+
+Both methods are presently only available in Scala and Java.


Do not use presently, we should say As of Spark 2.4, ...

I think it should be OK. In SQL programming guid, there is a lot of "currently". Otherwise we have to update the 2.4 for each release.(Is there any way to get the release version in the doc?)

cloud-fan · 2018-08-22T14:54:16Z

docs/avro-data-source-guide.md

+// 1. Decode the Avro data into a struct;
+// 2. Filter by column `favorite_color`;
+// 3. Encode the column `name` in Avro format.
+DataFrame output = df


Are you sure this compiles in Java?

Looks OK except missing a semicolon at the end of the statements.

cloud-fan · 2018-08-22T14:55:16Z

docs/avro-data-source-guide.md

+  <tr>
+    <td><code>avroSchema</code></td>
+    <td>None</td>
+    <td>Optional Avro schema provided by an user in JSON format.</td>


We should mention the behavior when the specified schema doesn't match the real schema.

cloud-fan · 2018-08-22T14:59:18Z

docs/avro-data-source-guide.md

+    <td></td>
+  </tr>
+  <tr>
+    <td>Date</td>


srowen · 2018-08-22T14:57:47Z

docs/avro-data-source-guide.md

+// 1. Decode the Avro data into a struct;
+// 2. Filter by column `favorite_color`;
+// 3. Encode the column `name` in Avro format.
+DataFrame output = df


Looks OK except missing a semicolon at the end of the statements.

srowen · 2018-08-22T14:58:41Z

docs/avro-data-source-guide.md

+import org.apache.spark.sql.avro.*
+
+// `from_avro` requires Avro schema in JSON string format.
+String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))


Nit: usually calling the String(byte[]) constructor is a bad idea as it interprets the bytes according to whatever the platform default encoding is. Add StandardCharsets.UTF_8 as a second arg, but, I odn't know if this is too picky to care about in the example.

I think it should be OK to ignore StandardCharsets.UTF_8.
The example code can be simple and just for demonstrating.
The key part is about to_avro and from_avro here.

srowen · 2018-08-22T15:00:42Z

docs/avro-data-source-guide.md

+</div>
+<div data-lang="java" markdown="1">
+{% highlight java %}
+import org.apache.spark.sql.avro.*


Semicolon at end of line (all statements in java)

dhruve · 2018-08-22T17:02:12Z

docs/avro-data-source-guide.md

+Submission Guide for more details. 
+
+## Supported types for Avro -> Spark SQL conversion
+Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and [complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of Avro.


Hey. I know that we didn't support reading primitive types in the databricks-avro package, so I just tried to read a primitive avro file and I wasn't able to do so using the current master.

How I tried reading it => spark.read.format("avro").load("avroPrimitiveTypes/randomBoolean.avro")

I think we could reword and be explicit that we support reading primitive types under records unless I am missing something here.

@gengliangwang ^^

SparkQA · 2018-08-22T17:28:32Z

Test build #95113 has finished for PR 22121 at commit 006ea40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-22T18:04:40Z

Test build #95116 has finished for PR 22121 at commit 581b7e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-08-23T05:19:09Z

The preview doc (zip file in PR description) is updated to latest version.

SparkQA · 2018-08-23T05:28:09Z

Test build #95139 has finished for PR 22121 at commit 8245806.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-23T05:35:14Z

Test build #95140 has finished for PR 22121 at commit 1f253bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-23T05:48:48Z

thanks, merging to master!

gengliangwang mentioned this pull request Aug 17, 2018

[SPARK-25129][SQL]Make the mapping of com.databricks.spark.avro to built-in module configurable #22133

Closed

gengliangwang changed the title ~~[SPARK-25133][SQL][Doc]AVRO data source guide~~ [SPARK-25133][SQL][Doc]Avro data source guide Aug 17, 2018

tgravescs mentioned this pull request Aug 17, 2018

[SPARK-25143][SQL] Support data source name mapping configuration #22134

Closed

tgravescs reviewed Aug 17, 2018

View reviewed changes

srowen reviewed Aug 17, 2018

View reviewed changes

gatorsmile reviewed Aug 17, 2018

View reviewed changes

docs/avro-data-source-guide.md Outdated

Copy link

Member

gatorsmile Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support -> built-in support

dongjoon-hyun reviewed Aug 17, 2018

View reviewed changes

docs/avro-data-source-guide.md Outdated

Copy link

Member

dongjoon-hyun Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not -> are not applied?

gengliangwang changed the title ~~[SPARK-25133][SQL][Doc]Avro data source guide~~ [WIP][SPARK-25133][SQL][Doc]Avro data source guide Aug 19, 2018

HyukjinKwon reviewed Aug 20, 2018

View reviewed changes

gengliangwang added 5 commits August 20, 2018 23:30

add avro-data-source-guide.md

6cbe082

revise doc

bb842b9

add conf spark.sql.avro.backwardCompatibility

9025940

address comments

7f37293

address comment

515aa0a

tgravescs reviewed Aug 21, 2018

View reviewed changes

add more sections

d2681ec

gengliangwang force-pushed the avroDoc branch from 8b191bd to d2681ec Compare August 22, 2018 08:46

gengliangwang changed the title ~~[WIP][SPARK-25133][SQL][Doc]Avro data source guide~~ [SPARK-25133][SQL][Doc]Avro data source guide Aug 22, 2018

revise

d9c5352

dongjoon-hyun reviewed Aug 22, 2018

View reviewed changes

tgravescs reviewed Aug 22, 2018

View reviewed changes

address comment

8da8250

cloud-fan reviewed Aug 22, 2018

View reviewed changes

docs/avro-data-source-guide.md Outdated

<td></td>

</tr>

<tr>

<td>Date</td>

Copy link

Contributor

cloud-fan Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateType

srowen reviewed Aug 22, 2018

View reviewed changes

dhruve reviewed Aug 22, 2018

View reviewed changes

address comments

006ea40

revise option doc

581b7e6

gengliangwang added 2 commits August 23, 2018 13:07

revise

8245806

more revise

1f253bf

asfgit closed this in 05974f9 Aug 23, 2018


		## Load and Save Functions

		Since `spark-avro` module is external, there is not such API as `.avro` in

[SPARK-25133][SQL][Doc]Avro data source guide #22121

[SPARK-25133][SQL][Doc]Avro data source guide #22121

Uh oh!

Conversation

gengliangwang commented Aug 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

gengliangwang commented Aug 17, 2018

Uh oh!

gatorsmile commented Aug 17, 2018

Uh oh!

SparkQA commented Aug 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Aug 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 17, 2018

Uh oh!

gatorsmile commented Aug 17, 2018

Uh oh!

gengliangwang commented Aug 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 17, 2018

Uh oh!

SparkQA commented Aug 18, 2018

Uh oh!

srowen commented Aug 19, 2018

Uh oh!

gengliangwang commented Aug 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgravescs Aug 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Aug 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Aug 22, 2018

Uh oh!

gengliangwang commented Aug 16, 2018 •

edited

Loading

gengliangwang Aug 17, 2018 •

edited

Loading

tgravescs Aug 21, 2018 •

edited

Loading

gengliangwang Aug 22, 2018 •

edited

Loading