Skip to content

Conversation

@jerryshao
Copy link
Contributor

What changes were proposed in this pull request?

We're building a data lineage tool in which we need to monitor the metadata changes in ExternalCatalog, current ExternalCatalog already provides several useful events like "CreateDatabaseEvent" for custom SparkListener to use. But still there's some event missing, like alter database event and alter table event. So here propose to and new ExternalCatalogEvent.

How was this patch tested?

Enrich the current UT and tested on local cluster.

CC @hvanhovell please let me know your comments about current proposal, thanks.

@SparkQA
Copy link

SparkQA commented Nov 3, 2017

Test build #83375 has finished for PR 19649 at commit a3867b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Event fired before table schema is altered.
*/
case class AlterTableSchemaPreEvent(database: String, name: String) extends TableEvent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate event for alter schema?

@hvanhovell
Copy link
Contributor

Looks good, one small question.

*/
def alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit
final def alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit = {
postToAll(AlterTableSchemaPreEvent(db, table))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AlterTableDataSchemaXXX

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we carry the new schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me I think it is not so necessary to carry the new schema, we can query the catalog by db and table to get this newly set schema.

* this becomes a no-op.
*/
def alterTable(tableDefinition: CatalogTable): Unit
final def alterTable(tableDefinition: CatalogTable): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we leave it for now and watch other alterTableXXX events instead. I feel it's an overkill to have a heavy alterTable method to handling all table metadata updating, I think we will add more and more fine-grained alter table methods in the future, like alterTableStats, alterTableDataSchema, etc. and eventually this alterTable method will go away.

Copy link
Contributor

@wzhfy wzhfy Nov 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on which level of details we wanna collect in the event. Are there any guidelines or documentation of what events spark should monitor?
Besides, partition changes are missing, I think it's necessary to monitor these changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan , since now we expose alterTable interface for other components to leverage, if we don't track this, then looks like we missed a piece of ExternalCatalogEvents. I think for now we can add this AlterTableEvent, later on if we removed this method, then we can make this event a no-op (only kept for compatibility), what do you think?

@wzhfy , I was thinking to add partition related events, but I'm not clearly sure why this whole piece is missing and is it necessary to add partition related events? If we have an agreement on such events, I'm OK to add them.

tableType = CatalogTableType.MANAGED,
storage = storage,
schema = new StructType().add("id", "long"))
val tableDefWithSparkVersion =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we need this val?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this was from my original code, will update it.


override def alterTableDataSchema(
db: String, table: String, newDataSchema: StructType): Unit = withClient {
override def doAlterTableDataSchema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add description of this method.


import org.apache.spark.scheduler.SparkListenerEvent
import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
import org.apache.spark.sql.types.StructType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see StructType and TablePartitionSpec are used below.

@SparkQA
Copy link

SparkQA commented Nov 6, 2017

Test build #83476 has finished for PR 19649 at commit 8eef970.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AlterTableDataSchemaPreEvent(database: String, name: String) extends TableEvent
  • case class AlterTableDataSchemaEvent(database: String, name: String) extends TableEvent

@jiangxb1987
Copy link
Contributor

lgtm

@viirya
Copy link
Member

viirya commented Nov 6, 2017

LGTM

* @param newDataSchema Updated data schema to be used for the table.
*/
def alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit
final def alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about alterTableStats?

/**
* Event fired after a table is altered.
*/
case class AlterTableEvent(database: String, name: String) extends TableEvent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have one event for all alterTableXXX, and add one more parameter to indicate which part of the table is altered.

* type will generally be Table.
*/
object AlterTableKind extends Enumeration {
val Table, DataSchema, Stats = Value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just use string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK to use String, but I'd prefer strong type to avoid nasty issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String is better for backward compatibility, but easier to make mistake. I don't have a strong preference, cc @hvanhovell @gatorsmile

@SparkQA
Copy link

SparkQA commented Nov 7, 2017

Test build #83538 has finished for PR 19649 at commit c79c314.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AlterTablePreEvent(
  • case class AlterTableEvent(

@jerryshao
Copy link
Contributor Author

One question as mentioned above also, do we need to track partition related events? @cloud-fan @hvanhovell @gatorsmile

@cloud-fan
Copy link
Contributor

not sure, but maybe do it in a new PR?


/**
* Event fired before a database is altered.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jerryshao .
We are adding AlterTableEvent and AlterDatabaseEvent. Can we have a more specific PR title instead of Add more ExternalCatalogEvent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will update the title.

@jerryshao jerryshao changed the title [SPARK-22405][SQL] Add more ExternalCatalogEvent [SPARK-22405][SQL] Add new alter table and alter database related ExternalCatalogEvent Nov 9, 2017
@SparkQA
Copy link

SparkQA commented Nov 9, 2017

Test build #83617 has finished for PR 19649 at commit 6b4fcff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

asfgit pushed a commit that referenced this pull request Dec 13, 2017
…file

## What changes were proposed in this pull request?

While spark code changes, there are new events in event log: #19649
And we used to maintain a whitelist to avoid exceptions: #15663
Currently Spark history server will stop parsing on unknown events or unrecognized properties. We may still see part of the UI data.
For better compatibility, we can ignore unknown events and parse through the log file.

## How was this patch tested?
Unit test

Author: Wang Gengliang <[email protected]>

Closes #19953 from gengliangwang/ReplayListenerBus.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants