HBASE-29003 Proper bulk load tracking #6506

DieterDP-ng · 2024-12-02T15:48:57Z

Note that this build on top of my older PR: #6134

ndimiduk · 2025-01-08T11:43:19Z

Do you mind fixing this broken checkstyle while you're in there?

[WARNING] The requested profile "test-patch" could not be activated because it does not exist.
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:3.1.0:checkstyle (default-cli) on project hbase-backup: An error has occurred in Checkstyle report generation. Failed during checkstyle configuration: Exception was thrown while processing /home/jenkins/jenkins-home/workspace/Base-PreCommit-GitHub-PR_PR-6506/yetus-general-check/src/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/BulkLoad.java: IllegalStateException occurred while parsing file /home/jenkins/jenkins-home/workspace/Base-PreCommit-GitHub-PR_PR-6506/yetus-general-check/src/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/BulkLoad.java. /home/jenkins/jenkins-home/workspace/Base-PreCommit-GitHub-PR_PR-6506/yetus-general-check/src/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/BulkLoad.java:72:33: expecting RPAREN, found 'that' -> [Help 1]

DieterDP-ng · 2025-02-07T14:01:01Z

Fixed the checkstyle error.

@ndimiduk @rmdmattingly the discussion on Jira about the best approach stopped. Should I wait for further feedback?

Personally, I think the best solution would be to track in the backup system table whether a full backup is required or not. This tracking would have to occur through an Observer as is done in this ticket. Given that this PR solves a real issue, I'd suggest to include it, and leave the switch to the tracking system to a later commit.

Note that users should update their config to include the new observer. Not sure if this counts as a migration since the backup is still considered experimental. Should be mentioned somewhere though, but unsure where.

ndimiduk · 2025-02-26T12:16:08Z

FYI @hgromer

ndimiduk · 2025-02-26T12:17:36Z

Note that users should update their config to include the new observer. Not sure if this counts as a migration since the backup is still considered experimental. Should be mentioned somewhere though, but unsure where.

This kind of change does demand a mention in the release notes. We have such a field on our Jira ticket schema, and it automatically gets populated at release time.

hgromer · 2025-02-26T13:09:24Z

I had actually spun up a Jira to this effect, didn't realize there was already an open PR

DieterDP-ng · 2025-03-06T15:05:27Z

Any advice on how to progress this ticket?

DieterDP-ng · 2025-03-07T09:21:31Z

Test failure looks unrelated to this PR, flaky test?

rmdmattingly · 2025-03-10T12:33:35Z

Hey @DieterDP-ng, thanks for the work here and sorry that I've been slow to respond. I will take a look here today

rmdmattingly

This looks good to me. @ndimiduk or @hgromer, any thoughts?

rmdmattingly · 2025-03-10T12:42:40Z

hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/BackupObserver.java

+    Predicate<BulkLoad> filter) throws IOException {
+    try (Connection connection = ConnectionFactory.createConnection(config);
+      BackupSystemTable tbl = new BackupSystemTable(connection)) {
+      List<BulkLoad> bulkLoads = tbl.readBulkloadRows(List.of(tableName));


This is a concern for another time, but it's only a matter of time until the lack of pagination in BackupSystemTable#readBulkloadRows bites us

hgromer · 2025-03-18T13:37:39Z

hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/FullTableBackupClient.java

      backupManager.writeBackupStartCode(newStartCode);

+      backupManager
+        .deleteBulkLoadedRows(bulkLoadsToDelete.stream().map(BulkLoad::getRowKey).toList());


At my company, we have a similar patch applied to our fork, and we've run into issues with batch sizes that causes backup failures. This seems to happen when there are too many rows to delete, you end up with something like

Caused by: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: Rejecting large batch operation for current batch with firstRegionName: backup:system_bulk,,1739970553683.c3828af81a4b3847aa0f1612bf638713. , Requested Number of Rows: 2048 , Size Threshold: 1500 at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkBatchSizeAndLogLargeSize(RSRpcServices.java:2721) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2757) at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:43520) at org.apache.ha

It might be worth splitting up this call to delete if they are exceptionally large

@hgromer do you feel that splitting up the delete is required for this patch?

@DieterDP-ng any concerns about making this change?

It might be worth splitting this up. Introducing this means we can get into a situation where backups are unable to succeed.

No objection to splitting it. Updating the PR...

Rather than doing batching here, I did it at BackupSystemTable.deleteBulkLoadedRows, since that method is used in multiple locations, and I guess they would all benefit from batching. See added commit (can be squashed).

(For future reference: turned out batching was already built-in, see #6506 (comment))

hgromer · 2025-04-02T13:55:57Z

hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupSystemTable.java

  public void deleteBulkLoadedRows(List<byte[]> rows) throws IOException {
-    try (BufferedMutator bufferedMutator = connection.getBufferedMutator(bulkLoadTableName)) {
+    try (BufferedMutator bufferedMutator = connection
+      .getBufferedMutator(new BufferedMutatorParams(bulkLoadTableName).setMaxMutations(1000))) {


Would it be possible to set this to the config value HConstants.BATCH_ROWS_THRESHOLD_NAME, seeing as that's the cluster limit for batches?

The configuration value must then be supplied by the operator launching the backup.

This pointer led me to conclude that it's not needed to specify batching at all, because it's already present.

Our method uses Connection#getBufferedMutator, for which the only non-delegating implementation is ConnectionOverAsyncConnection#getBufferedMutator. That one calls AsyncConnectionImpl#getBufferedMutatorBuilder(TableName). In that method we can see it uses a AsyncConnectionConfiguration object that has already extracted a batching value:

AsyncConnectionConfiguration(Configuration conf) { ... this.bufferedMutatorMaxMutations = conf.getInt(BUFFERED_MUTATOR_MAX_MUTATIONS_KEY, conf.getInt(HConstants.BATCH_ROWS_THRESHOLD_NAME, BUFFERED_MUTATOR_MAX_MUTATIONS_DEFAULT)); }

So I'll remove my change that introduced the batchsize.

Ah good to know, so the only thing needed is for the operator to specify the configuration in the Connection object that is passed to the BackupAdminImp

👍

rmdmattingly · 2025-04-08T13:25:43Z

Retrying the build

DieterDP-ng · 2025-04-14T16:15:47Z

The failed test runs OK for me. I guess another flaky test.

DieterDP-ng · 2025-04-14T16:38:07Z

hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupManager.java

-    }
+      (coproc == null ? "" : coproc + ",") + observerClass);
+
+    String masterCoProc = conf.get(CoprocessorHost.MASTER_COPROCESSOR_CONF_KEY);


This probably belongs in the decorateMasterConfiguration method instead.

Do you want to make this change? If not, I'm happy to ship this PR

Updated!

I also extracted all master-related stuff from the BackupObserver to BackupMasterObserver. Did this in the light of HBASE-29240, where there's more master-specific logic being added.

The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality.

Apache-HBase · 2025-04-17T10:32:17Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 28s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	hbaseanti	0m 0s		Patch does not have any anti-patterns.
			_ master Compile Tests _
+0 🆗	mvndep	0m 10s		Maven dependency ordering for branch
+1 💚	mvninstall	3m 10s		master passed
+1 💚	compile	7m 53s		master passed
+1 💚	checkstyle	1m 10s		master passed
+1 💚	spotbugs	9m 17s		master passed
+0 🆗	refguide	2m 18s		branch has no errors when building the reference guide. See footer for rendered docs, which you should manually inspect.
+1 💚	spotless	0m 44s		branch has no errors when running spotless:check.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 11s		Maven dependency ordering for patch
+1 💚	mvninstall	3m 1s		the patch passed
+1 💚	compile	7m 45s		the patch passed
+1 💚	javac	7m 45s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	1m 15s		the patch passed
+1 💚	spotbugs	9m 41s		the patch passed
+0 🆗	refguide	2m 4s		patch has no errors when building the reference guide. See footer for rendered docs, which you should manually inspect.
+1 💚	hadoopcheck	11m 50s		Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚	spotless	0m 47s		patch has no errors when running spotless:check.
			_ Other Tests _
+1 💚	asflicense	0m 29s		The patch does not generate ASF License warnings.
		70m 5s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6506/10/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#6506
Optional Tests	dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless refguide
uname	Linux 7f13fbdef710 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `9cf3008`
Default Java	Eclipse Adoptium-17.0.11+9
refguide	https://nightlies.apache.org/hbase/HBase-PreCommit-GitHub-PR/PR-6506/10/yetus-general-check/output/branch-site/book.html
refguide	https://nightlies.apache.org/hbase/HBase-PreCommit-GitHub-PR/PR-6506/10/yetus-general-check/output/patch-site/book.html
Max. process+thread count	187 (vs. ulimit of 30000)
modules	C: hbase-server hbase-backup . U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6506/10/console
versions	git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2025-04-17T15:15:47Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 27s		Docker mode activated.
-0 ⚠️	yetus	0m 3s		Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
			_ Prechecks _
			_ master Compile Tests _
+0 🆗	mvndep	0m 36s		Maven dependency ordering for branch
+1 💚	mvninstall	3m 25s		master passed
+1 💚	compile	2m 7s		master passed
+1 💚	javadoc	2m 44s		master passed
+1 💚	shadedjars	5m 57s		branch has no errors when building our shaded downstream artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 10s		Maven dependency ordering for patch
+1 💚	mvninstall	3m 1s		the patch passed
+1 💚	compile	2m 11s		the patch passed
+1 💚	javac	2m 11s		the patch passed
+1 💚	javadoc	2m 34s		the patch passed
+1 💚	shadedjars	5m 58s		patch has no errors when building our shaded downstream artifacts.
			_ Other Tests _
-1 ❌	unit	317m 34s	/patch-unit-root.txt	root in the patch failed.
		353m 33s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6506/10/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR	#6506
Optional Tests	javac javadoc unit compile shadedjars
uname	Linux d5a0bbc52f6e 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `9cf3008`
Default Java	Eclipse Adoptium-17.0.11+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6506/10/testReport/
Max. process+thread count	5112 (vs. ulimit of 30000)
modules	C: hbase-server hbase-backup . U: .
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6506/10/console
versions	git=2.34.1 maven=3.9.8
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality.

Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality. Co-authored-by: DieterDP <[email protected]>

Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality.

Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality. Co-authored-by: DieterDP <[email protected]>

The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality. Co-authored-by: DieterDP <[email protected]>

Signed-off-by: Ray Mattingly <[email protected]> The HBase backup mechanism keeps track of which HFiles were bulk loaded, so they can be included in incremental backups. Before this ticket, these bulk load records were only deleted when an incremental backup is created. This commit adds 2 more locations: 1) after a full backup. Since a full backup already captures all data, this meant that unnecessary HFiles were being included in the next incremental backup. 2) after a table delete/truncate/CF-deletion. Previously, if an HFile was loaded before a table was cleared, the next incremental backup would effectively still include the HFile. This lead to incorrect data being restored. This commit also completely refactors & simplifies the test for this functionality. Co-authored-by: DieterDP <[email protected]>