-
Couldn't load subscription status.
- Fork 9.1k
HDFS-17821. Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs #7876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3bf0ae1 to
867b137
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
@tomscut @Hexiaoqiao Hi, could you help to review this issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We verified this change in production and it worked fine. The change looks good to me.
867b137 to
53b528b
Compare
|
💔 -1 overall
This message was automatically generated. |
b651fda to
ed98f0b
Compare
|
💔 -1 overall
This message was automatically generated. |
ed98f0b to
aa29bb8
Compare
|
💔 -1 overall
This message was automatically generated. |
aa29bb8 to
613a307
Compare
|
💔 -1 overall
This message was automatically generated. |
613a307 to
abd4585
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
abd4585 to
1e74909
Compare
|
💔 -1 overall
This message was automatically generated. |
1e74909 to
2c7e664
Compare
|
💔 -1 overall
This message was automatically generated. |
2c7e664 to
6f4a3b2
Compare
…rt of the multiple NNs
6f4a3b2 to
28d18a6
Compare
|
💔 -1 overall
This message was automatically generated. |
|
The unit test failure was not related to the change. |
|
@lfxy Thanks for your contribution! |
In our cluster with observer NNs, when the standby NN is doing a checkpoint and sending the fsimage to other NNs, if the sending fails of one NN due to network anomalies, NN restarts, or other exceptions, the standby will consider this Checkpoint as failed and does not update the lastCheckpointTime, and retry checkpoints.
However, the active or observer NNs which successfully received the fsimage has update their lastCheckpointTime, and the NN which receive fsimage failed don't update its lastCheckpointTime, resulting in inconsistent lastCheckpointTime across the NNs. This causes subsequent checkpoints to repeatedly fail to send fsimage to part or all active or observer NNs, because they do not satisfy the DFS_NAMENODE_CHECKPOINT_PERIOD_KEY condition.
Then the SNN will always failed to do checkpoint and repeat retry. I think that the SNN should consider the checkpoint successful and update its lastCheckpointTime if the fsimage transmission succeeds on at least half of the NNs.