Skip to content

Conversation

@virajjasani
Copy link
Contributor

@virajjasani virajjasani commented Nov 13, 2024

Jira: HBASE-28638

Master initiated remote procedures are scheduled by RSProcedureDispatcher. If it encounters specific errors on first retry (e.g. CallQueueTooBigException or SaslException), it is guaranteed that the remote call has not reached the regionserver, therefore the remote call is marked failed prompting the parent procedure to select different target regionserver to resume the operation.
If the first attempt is successful, RSProcedureDispatcher continues with infinite retries. We can encounter valid case (e.g. ConnectionClosedException) which is halting the remote operation. Without manual intervention, it can cause significant delay upto several minutes or hours to the region-in-transition.

The purpose of this Jira is to impose retry limit for specific error types such that if the retry limit is reached, the master can recover the state of the ongoing remote call failure by initiating SCP (ServerCrashProcedure) on the target server. The SCP is going to override the TRSP (TransitRegionStateProcedure) if required. This can ensure that the target server has no region hosted online before we suspend the ongoing TRSP.

Scheduling SCP for the target server will always lead to the regionserver in stopped state. Either regionserver would be automatically stopped, or if the regionserver is able to send the region report to master, master will reject it, which will further lead to regionserver abort.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

private void createProcedureExecutor() throws IOException {
MasterProcedureEnv procEnv = new MasterProcedureEnv(this);
final String procedureDispatcherClassName =
conf.get(HBASE_MASTER_RSPROC_DISPATCHER_CLASS, DEFAULT_HBASE_MASTER_RSPROC_DISPATCHER_CLASS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for extention of RSProcedureDispatcher or can there be a new implementation of RemoteProcedureDispatcher ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows only extension of RSProcedureDispatcher, mainly to be used for testing purpose.


/**
* Use RSProcedureDispatcher instance to initiate master -> rs remote procedure execution. Use
* this config to provide customized RSProcedureDispatcher.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we planning here @virajjasani ?

Copy link
Contributor Author

@virajjasani virajjasani Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current master's proc executor initialization, we are not able to provide a setter method for custom RSProcedureDispatcher without making significant changes.

Introducing config here to achieve the same is cleaner than setter method but the plan is only for test to use the custom implementation of RSProcedureDispatcher.

/**
* The default retry limit. Value = {@value}
*/
private static final int DEFAULT_RS_REMOTE_PROC_RETRY_LIMIT = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 5? Why not 3?
Let's have a comment on justification for the default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @param e IOException thrown by the underlying rpc framework.
* @return True if the error is eligible for imposing retry limit.
*/
private boolean isErrorTypeEligibleForRetryLimit(IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "eligible for retry limit" type wording, you should describe this as "fast fail" in method naming and comment text.
Ok, and then following this recommendation, the config hbase.master.rs.remote.proc.retry.limit might be better named hbase.master.rs.remote.proc.fast.fail.limit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache9
Copy link
Contributor

Apache9 commented Nov 15, 2024

Why do we need a new implementation of RSProcedureDispatcher?

I think here we can introduce a retry limit, by default it is -1, which means infinite retry, which keeps the old behavior, and if set to a positive value, we will schedule SCP when reaching the limit?

@virajjasani
Copy link
Contributor Author

virajjasani commented Nov 17, 2024

Why do we need a new implementation of RSProcedureDispatcher?

It is done only for testing purpose. There is no new implementation as part of the source code.

I think here we can introduce a retry limit

Yes, it is done with hbase.master.rs.remote.proc.fail.fast.limit config.

/**
* Test implementation of RSProcedureDispatcher that throws desired errors for testing purpose.
*/
public class RSProcDispatcher extends RSProcedureDispatcher {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extension of RSProcedureDispatcher is under hbase-server/src/test package.

@virajjasani virajjasani changed the title HBASE-28638 Impose retry limit for specific errors to recover from remote procedure failure using server crash HBASE-28638 Fail-fast retry limit for specific errors to recover from remote procedure failure using server crash Nov 17, 2024
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

if (numberOfAttemptsSoFar == 0 && unableToConnectToServer(e)) {
return false;
}
ExecuteProceduresRequest executeProceduresRequest = request.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we do not need this change?

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

}
Throwable cause = e;
while (true) {
if (cause instanceof IOException) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can only unwrap RemoteException, so here let's just test RemoteException directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoteException is of type IOE and I kept this to be consistent with other exception check logic we have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fine @Apache9? I can change it if you have strong opinion on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I saw you have already used this style in isSaslError, I should stop it at the first place...

Let's file new issues to polish it. UnwrapException is not designed to be used here, it just want to put the instanceof test inside the method so we do not need to write it everywhere but here we already have the instanceof, so calling this method again does not make sense...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, isConnectionClosedError() also follows the same pattern. Logically there is no difference but your suggestion will make it look cleaner so let's do it in separate Jira?

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@virajjasani virajjasani requested a review from Apache9 December 19, 2024 03:24
@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 36s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+1 💚 mvninstall 3m 2s master passed
+1 💚 compile 2m 40s master passed
+1 💚 checkstyle 0m 36s master passed
+1 💚 spotbugs 1m 34s master passed
+1 💚 spotless 0m 39s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 2m 52s the patch passed
+1 💚 compile 3m 5s the patch passed
+1 💚 javac 3m 5s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 34s the patch passed
+1 💚 spotbugs 1m 34s the patch passed
+1 💚 hadoopcheck 9m 44s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚 spotless 0m 39s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 10s The patch does not generate ASF License warnings.
33m 56s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6462/8/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #6462
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 6971b5a7933a 5.4.0-200-generic #220-Ubuntu SMP Fri Sep 27 13:19:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f8fd6b9
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 85 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6462/8/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 26s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 3m 4s master passed
+1 💚 compile 0m 58s master passed
+1 💚 javadoc 0m 27s master passed
+1 💚 shadedjars 5m 40s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 2m 54s the patch passed
+1 💚 compile 0m 54s the patch passed
+1 💚 javac 0m 54s the patch passed
+1 💚 javadoc 0m 26s the patch passed
+1 💚 shadedjars 5m 37s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
-1 ❌ unit 179m 33s /patch-unit-hbase-server.txt hbase-server in the patch failed.
204m 4s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6462/8/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #6462
Optional Tests javac javadoc unit compile shadedjars
uname Linux a683c22e0d9d 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f8fd6b9
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6462/8/testReport/
Max. process+thread count 5137 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6462/8/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@virajjasani virajjasani merged commit 3748204 into apache:master Jan 3, 2025
1 check failed
virajjasani added a commit that referenced this pull request Jan 3, 2025
… remote procedure failure using server crash (#6462)

Signed-off-by: Duo Zhang <[email protected]>
virajjasani added a commit that referenced this pull request Jan 3, 2025
… remote procedure failure using server crash (#6564) (#6462)

Signed-off-by: Duo Zhang <[email protected]>
virajjasani added a commit that referenced this pull request Jan 3, 2025
… remote procedure failure using server crash (#6564) (#6462)

Signed-off-by: Duo Zhang <[email protected]>
virajjasani added a commit that referenced this pull request Jan 3, 2025
… remote procedure failure using server crash (#6564) (#6462)

Signed-off-by: Duo Zhang <[email protected]>
ragarkar pushed a commit to ragarkar/hbase that referenced this pull request Jan 3, 2025
… remote procedure failure using server crash (apache#6462)

Signed-off-by: Duo Zhang <[email protected]>
mokai87 pushed a commit to mokai87/hbase that referenced this pull request Aug 7, 2025
… remote procedure failure using server crash (apache#6564) (apache#6462)

Signed-off-by: Duo Zhang <[email protected]>
sanjeet006py pushed a commit to sanjeet006py/hbase that referenced this pull request Aug 24, 2025
… remote procedure failure using server crash (apache#6564) (apache#6462)

Signed-off-by: Duo Zhang <[email protected]>
sanjeet006py pushed a commit to sanjeet006py/hbase that referenced this pull request Sep 26, 2025
… remote procedure failure using server crash (apache#6564) (apache#6462)

Signed-off-by: Duo Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants