Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Aug 16, 2018

What changes were proposed in this pull request?

Create documentation for AVRO data source.
The new page will be linked in https://spark.apache.org/docs/latest/sql-programming-guide.html

For preview please unzip the following file:
AvroDoc.zip

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94850 has finished for PR 22121 at commit 3d8220f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94859 has finished for PR 22121 at commit 030ca0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

@cloud-fan @gatorsmile

@gatorsmile
Copy link
Member

@gengliangwang Could you also post the screen shot in your PR description?

@gengliangwang gengliangwang changed the title [SPARK-25133][SQL][Doc]AVRO data source guide [SPARK-25133][SQL][Doc]Avro data source guide Aug 17, 2018
@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94886 has finished for PR 22121 at commit 72c8ef2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also mention you can include with --jars if you build the jar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am following https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying .
Using --packages ensures that this library and its dependencies will be added to the classpath, which should be good enough for general users.
For users build their jar, they are supposed to know the general option --jars.
I can add it if you insist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. When I see a deploying section I would expect it to tell me what my options are so perhaps just rephrasing to more indicate --packages is one way to do it.

It would be nice to at least have a general statement saying the external modules aren't including with spark by default, the user must include the necessary jars themselves. The way to do this will be deployment specific. One way of doing this is via the --packages option.

Note I think the structured-streaming-kafka section should ideally be updated to something similar as well. And really any external module for that matter. It would be nice to tell users how they can include these without assuming they just know how to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the --jars option is well explained in https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management . And the doc url is also mentioned in both Deploying sections.
I still feel it is unnecessary to have a short introduction about --jars option here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be higher up not in the examples section. Perhaps in its own compatibility section.

Copy link
Member Author

@gengliangwang gengliangwang Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I can change the title as read/write Avro data...Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the configuration here has not spark. prefix? this is set via the .option interface?
I think we should clarify that for the user vs later in the table you have the spark. configs that I assume aren't set via option but via --conf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it's just called "Avro", and we should call it "Apache Avro" here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call it "Apache Avro" in the title and first mention in the paragraph below. Afterwards, just "Avro" is OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use back-ticks rather than <code> for simpler code formatting. No big deal either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after headings like this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support -> built-in support

@gatorsmile
Copy link
Member

We should do the same thing for the other native sources.

@gatorsmile
Copy link
Member

We also need to document the extra enhancements that are added in this release, compared with the databricks/spark-avro package.

@gengliangwang
Copy link
Member Author

@tgravescs @srowen @gatorsmile Thanks for the reviewing. I will keep updating this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the repetition, line 191 ~ 195?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a mistake. Thanks for pointing out!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not -> are not applied?

@SparkQA
Copy link

SparkQA commented Aug 17, 2018

Test build #94905 has finished for PR 22121 at commit ff6d3ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 18, 2018

Test build #94928 has finished for PR 22121 at commit 8b191bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 19, 2018

@gatorsmile does this address your comment about documenting new features?

@gengliangwang gengliangwang changed the title [SPARK-25133][SQL][Doc]Avro data source guide [WIP][SPARK-25133][SQL][Doc]Avro data source guide Aug 19, 2018
@gengliangwang
Copy link
Member Author

@srowen Hi Sean, I will add content for new features soon. I also updated the title.
Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang, I could check it by myself but thought it's easier to ask to you. Do we now all have the options and configurations existent in spark-avro?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For data source options, yes.
For SQL configuration, I think the only one matters is the one in #22133. I am thinking of a better name for that configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a section for the SQL configurations.

Copy link
Contributor

@tgravescs tgravescs Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note I think we should add a compatibility section for compatibliity with databricks avro version here, reference #22133

Copy link
Member Author

@gengliangwang gengliangwang Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs I have added an independent section for it :)

@gengliangwang gengliangwang changed the title [WIP][SPARK-25133][SQL][Doc]Avro data source guide [SPARK-25133][SQL][Doc]Avro data source guide Aug 22, 2018
@gengliangwang
Copy link
Member Author

@srowen @tgravescs @gatorsmile @HyukjinKwon @dongjoon-hyun Thanks for the reviews! I have added section to_avro() and from_avro() and Compatibility with Databricks spark-avro.

Also attach html file for preview, please check it in PR description.

@SparkQA
Copy link

SparkQA commented Aug 22, 2018

Test build #95099 has finished for PR 22121 at commit d2681ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2018

Test build #95100 has finished for PR 22121 at commit d9c5352.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// 2. Filter by column `favorite_color`;
// 3. Encode the column `name` in Avro format.
DataFrame output = df
.select(from_avro(col("value"), jsonFormatSchema).as("user"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Load and Save Functions

Since `spark-avro` module is external, there is not such API as `.avro` in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no '.avro' API in

@SparkQA
Copy link

SparkQA commented Aug 22, 2018

Test build #95105 has finished for PR 22121 at commit 8da8250.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

</div>

## to_avro() and from_avro()
Spark SQL provides function `to_avro` to encode a struct as a string and `from_avro()` to retrieve the struct as a complex type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not "Spark SQL", it should be "The Avro package"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode a struct as a string, I think it's not "string", but "binary"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need to be a struct or any spark sql type?
maybe: to_avro to encode spark sql types as avro bytes and from_avro to retrieve avro bytes as spark sql types?

* If the "value" field that contains your data is in Avro, you could use `from_avro()` to extract your data, enrich it, clean it, and then push it downstream to Kafka again or write it out to a file.
* `to_avro()` can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka.

Both methods are presently only available in Scala and Java.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use presently, we should say As of Spark 2.4, ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be OK. In SQL programming guid, there is a lot of "currently". Otherwise we have to update the 2.4 for each release.(Is there any way to get the release version in the doc?)

// 1. Decode the Avro data into a struct;
// 2. Filter by column `favorite_color`;
// 3. Encode the column `name` in Avro format.
DataFrame output = df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this compiles in Java?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK except missing a semicolon at the end of the statements.

<tr>
<td><code>avroSchema</code></td>
<td>None</td>
<td>Optional Avro schema provided by an user in JSON format.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention the behavior when the specified schema doesn't match the real schema.

<td></td>
</tr>
<tr>
<td>Date</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateType

// 1. Decode the Avro data into a struct;
// 2. Filter by column `favorite_color`;
// 3. Encode the column `name` in Avro format.
DataFrame output = df
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK except missing a semicolon at the end of the statements.

import org.apache.spark.sql.avro.*

// `from_avro` requires Avro schema in JSON string format.
String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: usually calling the String(byte[]) constructor is a bad idea as it interprets the bytes according to whatever the platform default encoding is. Add StandardCharsets.UTF_8 as a second arg, but, I odn't know if this is too picky to care about in the example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be OK to ignore StandardCharsets.UTF_8.
The example code can be simple and just for demonstrating.
The key part is about to_avro and from_avro here.

</div>
<div data-lang="java" markdown="1">
{% highlight java %}
import org.apache.spark.sql.avro.*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semicolon at end of line (all statements in java)

Submission Guide for more details.

## Supported types for Avro -> Spark SQL conversion
Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and [complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of Avro.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey. I know that we didn't support reading primitive types in the databricks-avro package, so I just tried to read a primitive avro file and I wasn't able to do so using the current master.

How I tried reading it => spark.read.format("avro").load("avroPrimitiveTypes/randomBoolean.avro")

I think we could reword and be explicit that we support reading primitive types under records unless I am missing something here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Aug 22, 2018

Test build #95113 has finished for PR 22121 at commit 006ea40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2018

Test build #95116 has finished for PR 22121 at commit 581b7e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

The preview doc (zip file in PR description) is updated to latest version.

@SparkQA
Copy link

SparkQA commented Aug 23, 2018

Test build #95139 has finished for PR 22121 at commit 8245806.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2018

Test build #95140 has finished for PR 22121 at commit 1f253bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 05974f9 Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants