-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-29003 Proper bulk load tracking #6506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
46a5851 to
fbb7080
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
fbb7080 to
44307bf
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Do you mind fixing this broken checkstyle while you're in there? |
15fdfe7 to
01b6b87
Compare
|
Fixed the checkstyle error. @ndimiduk @rmdmattingly the discussion on Jira about the best approach stopped. Should I wait for further feedback? Personally, I think the best solution would be to track in the backup system table whether a full backup is required or not. This tracking would have to occur through an Observer as is done in this ticket. Given that this PR solves a real issue, I'd suggest to include it, and leave the switch to the tracking system to a later commit. Note that users should update their config to include the new observer. Not sure if this counts as a migration since the backup is still considered experimental. Should be mentioned somewhere though, but unsure where. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
FYI @hgromer |
This kind of change does demand a mention in the release notes. We have such a field on our Jira ticket schema, and it automatically gets populated at release time. |
|
I had actually spun up a Jira to this effect, didn't realize there was already an open PR |
|
Any advice on how to progress this ticket? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Test failure looks unrelated to this PR, flaky test? |
|
Hey @DieterDP-ng, thanks for the work here and sorry that I've been slow to respond. I will take a look here today |
rmdmattingly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Predicate<BulkLoad> filter) throws IOException { | ||
| try (Connection connection = ConnectionFactory.createConnection(config); | ||
| BackupSystemTable tbl = new BackupSystemTable(connection)) { | ||
| List<BulkLoad> bulkLoads = tbl.readBulkloadRows(List.of(tableName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a concern for another time, but it's only a matter of time until the lack of pagination in BackupSystemTable#readBulkloadRows bites us
| backupManager.writeBackupStartCode(newStartCode); | ||
|
|
||
| backupManager | ||
| .deleteBulkLoadedRows(bulkLoadsToDelete.stream().map(BulkLoad::getRowKey).toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At my company, we have a similar patch applied to our fork, and we've run into issues with batch sizes that causes backup failures. This seems to happen when there are too many rows to delete, you end up with something like
Caused by: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: Rejecting large batch operation for current batch with firstRegionName: backup:system_bulk,,1739970553683.c3828af81a4b3847aa0f1612bf638713. , Requested Number of Rows: 2048 , Size Threshold: 1500
at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkBatchSizeAndLogLargeSize(RSRpcServices.java:2721)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2757)
at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:43520)
at org.apache.ha
It might be worth splitting up this call to delete if they are exceptionally large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hgromer do you feel that splitting up the delete is required for this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DieterDP-ng any concerns about making this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth splitting this up. Introducing this means we can get into a situation where backups are unable to succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objection to splitting it. Updating the PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing batching here, I did it at BackupSystemTable.deleteBulkLoadedRows, since that method is used in multiple locations, and I guess they would all benefit from batching. See added commit (can be squashed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(For future reference: turned out batching was already built-in, see #6506 (comment))
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| public void deleteBulkLoadedRows(List<byte[]> rows) throws IOException { | ||
| try (BufferedMutator bufferedMutator = connection.getBufferedMutator(bulkLoadTableName)) { | ||
| try (BufferedMutator bufferedMutator = connection | ||
| .getBufferedMutator(new BufferedMutatorParams(bulkLoadTableName).setMaxMutations(1000))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to set this to the config value HConstants.BATCH_ROWS_THRESHOLD_NAME, seeing as that's the cluster limit for batches?
The configuration value must then be supplied by the operator launching the backup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pointer led me to conclude that it's not needed to specify batching at all, because it's already present.
Our method uses Connection#getBufferedMutator, for which the only non-delegating implementation is ConnectionOverAsyncConnection#getBufferedMutator. That one calls AsyncConnectionImpl#getBufferedMutatorBuilder(TableName). In that method we can see it uses a AsyncConnectionConfiguration object that has already extracted a batching value:
AsyncConnectionConfiguration(Configuration conf) {
...
this.bufferedMutatorMaxMutations = conf.getInt(BUFFERED_MUTATOR_MAX_MUTATIONS_KEY,
conf.getInt(HConstants.BATCH_ROWS_THRESHOLD_NAME, BUFFERED_MUTATOR_MAX_MUTATIONS_DEFAULT));
}
So I'll remove my change that introduced the batchsize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good to know, so the only thing needed is for the operator to specify the configuration in the Connection object that is passed to the BackupAdminImp
👍
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Retrying the build |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
The failed test runs OK for me. I guess another flaky test. |
| } | ||
| (coproc == null ? "" : coproc + ",") + observerClass); | ||
|
|
||
| String masterCoProc = conf.get(CoprocessorHost.MASTER_COPROCESSOR_CONF_KEY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably belongs in the decorateMasterConfiguration method instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to make this change? If not, I'm happy to ship this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
I also extracted all master-related stuff from the BackupObserver to BackupMasterObserver. Did this in the light of HBASE-29240, where there's more master-specific logic being added.
The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality.
864b9c7 to
9cf3008
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality.
Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality.
Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality.
Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality. Co-authored-by: DieterDP <[email protected]>
Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality.
Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality. Co-authored-by: DieterDP <[email protected]>
Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality. Co-authored-by: DieterDP <[email protected]>
Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality. Co-authored-by: DieterDP <[email protected]>
The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality. Co-authored-by: DieterDP <[email protected]>
Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality. Co-authored-by: DieterDP <[email protected]>
Note that this build on top of my older PR: #6134