-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-22405][SQL] Add new alter table and alter database related ExternalCatalogEvent #19649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5c628be to
a3867b7
Compare
|
Test build #83375 has finished for PR 19649 at commit
|
| /** | ||
| * Event fired before table schema is altered. | ||
| */ | ||
| case class AlterTableSchemaPreEvent(database: String, name: String) extends TableEvent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate event for alter schema?
|
Looks good, one small question. |
| */ | ||
| def alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit | ||
| final def alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit = { | ||
| postToAll(AlterTableSchemaPreEvent(db, table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AlterTableDataSchemaXXX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we carry the new schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me I think it is not so necessary to carry the new schema, we can query the catalog by db and table to get this newly set schema.
| * this becomes a no-op. | ||
| */ | ||
| def alterTable(tableDefinition: CatalogTable): Unit | ||
| final def alterTable(tableDefinition: CatalogTable): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest we leave it for now and watch other alterTableXXX events instead. I feel it's an overkill to have a heavy alterTable method to handling all table metadata updating, I think we will add more and more fine-grained alter table methods in the future, like alterTableStats, alterTableDataSchema, etc. and eventually this alterTable method will go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on which level of details we wanna collect in the event. Are there any guidelines or documentation of what events spark should monitor?
Besides, partition changes are missing, I think it's necessary to monitor these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan , since now we expose alterTable interface for other components to leverage, if we don't track this, then looks like we missed a piece of ExternalCatalogEvents. I think for now we can add this AlterTableEvent, later on if we removed this method, then we can make this event a no-op (only kept for compatibility), what do you think?
@wzhfy , I was thinking to add partition related events, but I'm not clearly sure why this whole piece is missing and is it necessary to add partition related events? If we have an agreement on such events, I'm OK to add them.
| tableType = CatalogTableType.MANAGED, | ||
| storage = storage, | ||
| schema = new StructType().add("id", "long")) | ||
| val tableDefWithSparkVersion = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we need this val?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this was from my original code, will update it.
|
|
||
| override def alterTableDataSchema( | ||
| db: String, table: String, newDataSchema: StructType): Unit = withClient { | ||
| override def doAlterTableDataSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add description of this method.
|
|
||
| import org.apache.spark.scheduler.SparkListenerEvent | ||
| import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec | ||
| import org.apache.spark.sql.types.StructType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see StructType and TablePartitionSpec are used below.
|
Test build #83476 has finished for PR 19649 at commit
|
|
lgtm |
|
LGTM |
| * @param newDataSchema Updated data schema to be used for the table. | ||
| */ | ||
| def alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit | ||
| final def alterTableDataSchema(db: String, table: String, newDataSchema: StructType): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about alterTableStats?
| /** | ||
| * Event fired after a table is altered. | ||
| */ | ||
| case class AlterTableEvent(database: String, name: String) extends TableEvent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can have one event for all alterTableXXX, and add one more parameter to indicate which part of the table is altered.
| * type will generally be Table. | ||
| */ | ||
| object AlterTableKind extends Enumeration { | ||
| val Table, DataSchema, Stats = Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we just use string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK to use String, but I'd prefer strong type to avoid nasty issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String is better for backward compatibility, but easier to make mistake. I don't have a strong preference, cc @hvanhovell @gatorsmile
|
Test build #83538 has finished for PR 19649 at commit
|
|
One question as mentioned above also, do we need to track partition related events? @cloud-fan @hvanhovell @gatorsmile |
|
not sure, but maybe do it in a new PR? |
|
|
||
| /** | ||
| * Event fired before a database is altered. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jerryshao .
We are adding AlterTableEvent and AlterDatabaseEvent. Can we have a more specific PR title instead of Add more ExternalCatalogEvent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will update the title.
|
Test build #83617 has finished for PR 19649 at commit
|
|
thanks, merging to master! |
…file ## What changes were proposed in this pull request? While spark code changes, there are new events in event log: #19649 And we used to maintain a whitelist to avoid exceptions: #15663 Currently Spark history server will stop parsing on unknown events or unrecognized properties. We may still see part of the UI data. For better compatibility, we can ignore unknown events and parse through the log file. ## How was this patch tested? Unit test Author: Wang Gengliang <[email protected]> Closes #19953 from gengliangwang/ReplayListenerBus.
What changes were proposed in this pull request?
We're building a data lineage tool in which we need to monitor the metadata changes in ExternalCatalog, current ExternalCatalog already provides several useful events like "CreateDatabaseEvent" for custom SparkListener to use. But still there's some event missing, like alter database event and alter table event. So here propose to and new ExternalCatalogEvent.
How was this patch tested?
Enrich the current UT and tested on local cluster.
CC @hvanhovell please let me know your comments about current proposal, thanks.