-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-28638 Fail-fast retry limit for specific errors to recover from remote procedure failure using server crash #6462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…mote procedure failure using server crash
617c7c3 to
a55b88f
Compare
| private void createProcedureExecutor() throws IOException { | ||
| MasterProcedureEnv procEnv = new MasterProcedureEnv(this); | ||
| final String procedureDispatcherClassName = | ||
| conf.get(HBASE_MASTER_RSPROC_DISPATCHER_CLASS, DEFAULT_HBASE_MASTER_RSPROC_DISPATCHER_CLASS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for extention of RSProcedureDispatcher or can there be a new implementation of RemoteProcedureDispatcher ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows only extension of RSProcedureDispatcher, mainly to be used for testing purpose.
|
|
||
| /** | ||
| * Use RSProcedureDispatcher instance to initiate master -> rs remote procedure execution. Use | ||
| * this config to provide customized RSProcedureDispatcher. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we planning here @virajjasani ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current master's proc executor initialization, we are not able to provide a setter method for custom RSProcedureDispatcher without making significant changes.
Introducing config here to achieve the same is cleaner than setter method but the plan is only for test to use the custom implementation of RSProcedureDispatcher.
| /** | ||
| * The default retry limit. Value = {@value} | ||
| */ | ||
| private static final int DEFAULT_RS_REMOTE_PROC_RETRY_LIMIT = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 5? Why not 3?
Let's have a comment on justification for the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * @param e IOException thrown by the underlying rpc framework. | ||
| * @return True if the error is eligible for imposing retry limit. | ||
| */ | ||
| private boolean isErrorTypeEligibleForRetryLimit(IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "eligible for retry limit" type wording, you should describe this as "fast fail" in method naming and comment text.
Ok, and then following this recommendation, the config hbase.master.rs.remote.proc.retry.limit might be better named hbase.master.rs.remote.proc.fast.fail.limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Why do we need a new implementation of RSProcedureDispatcher? I think here we can introduce a retry limit, by default it is -1, which means infinite retry, which keeps the old behavior, and if set to a positive value, we will schedule SCP when reaching the limit? |
It is done only for testing purpose. There is no new implementation as part of the source code.
Yes, it is done with |
| /** | ||
| * Test implementation of RSProcedureDispatcher that throws desired errors for testing purpose. | ||
| */ | ||
| public class RSProcDispatcher extends RSProcedureDispatcher { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This extension of RSProcedureDispatcher is under hbase-server/src/test package.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| if (numberOfAttemptsSoFar == 0 && unableToConnectToServer(e)) { | ||
| return false; | ||
| } | ||
| ExecuteProceduresRequest executeProceduresRequest = request.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we do not need this change?
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
Show resolved
Hide resolved
| } | ||
| Throwable cause = e; | ||
| while (true) { | ||
| if (cause instanceof IOException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can only unwrap RemoteException, so here let's just test RemoteException directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemoteException is of type IOE and I kept this to be consistent with other exception check logic we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this fine @Apache9? I can change it if you have strong opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I saw you have already used this style in isSaslError, I should stop it at the first place...
Let's file new issues to polish it. UnwrapException is not designed to be used here, it just want to put the instanceof test inside the method so we do not need to write it everywhere but here we already have the instanceof, so calling this method again does not make sense...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, isConnectionClosedError() also follows the same pattern. Logically there is no difference but your suggestion will make it look cleaner so let's do it in separate Jira?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
… remote procedure failure using server crash (#6462) Signed-off-by: Duo Zhang <[email protected]>
… remote procedure failure using server crash (#6564) (#6462) Signed-off-by: Duo Zhang <[email protected]>
… remote procedure failure using server crash (#6564) (#6462) Signed-off-by: Duo Zhang <[email protected]>
… remote procedure failure using server crash (#6564) (#6462) Signed-off-by: Duo Zhang <[email protected]>
… remote procedure failure using server crash (apache#6462) Signed-off-by: Duo Zhang <[email protected]>
… remote procedure failure using server crash (apache#6564) (apache#6462) Signed-off-by: Duo Zhang <[email protected]>
… remote procedure failure using server crash (apache#6564) (apache#6462) Signed-off-by: Duo Zhang <[email protected]>
… remote procedure failure using server crash (apache#6564) (apache#6462) Signed-off-by: Duo Zhang <[email protected]>
Jira: HBASE-28638
Master initiated remote procedures are scheduled by RSProcedureDispatcher. If it encounters specific errors on first retry (e.g. CallQueueTooBigException or SaslException), it is guaranteed that the remote call has not reached the regionserver, therefore the remote call is marked failed prompting the parent procedure to select different target regionserver to resume the operation.
If the first attempt is successful, RSProcedureDispatcher continues with infinite retries. We can encounter valid case (e.g. ConnectionClosedException) which is halting the remote operation. Without manual intervention, it can cause significant delay upto several minutes or hours to the region-in-transition.
The purpose of this Jira is to impose retry limit for specific error types such that if the retry limit is reached, the master can recover the state of the ongoing remote call failure by initiating SCP (ServerCrashProcedure) on the target server. The SCP is going to override the TRSP (TransitRegionStateProcedure) if required. This can ensure that the target server has no region hosted online before we suspend the ongoing TRSP.
Scheduling SCP for the target server will always lead to the regionserver in stopped state. Either regionserver would be automatically stopped, or if the regionserver is able to send the region report to master, master will reject it, which will further lead to regionserver abort.