-
Notifications
You must be signed in to change notification settings - Fork 40
fix(chain): improve chain scheduler robustness and eth tx monitoring (synapse) #553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(chain): improve chain scheduler robustness and eth tx monitoring (synapse) #553
Conversation
|
Caught one with this PR: And in Lotus: But, now that tx is in perpetual pending mode: But it's on chain: https://calibration.filfox.info/en/message/0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83?t=1 Lotus just doesn't know about it: But on my Calibnet node: So this looks kind of like a ChainIndexer bug as well. |
|
Oh, actually, that tx is from So, a couple of problems here:
I'll observe this a little longer and then tidy it up for potential merge. |
430f43a to
ee0ef4d
Compare
- Add mutex protection and notification timeout handling to CurioChainSched - Implement timeout protection for eth client calls to prevent indefinite blocking - Add comprehensive test coverage for chain scheduler and eth message watcher - Extract interfaces (EthClient, TaskEngine, EthTransactionManager) for testability
ee0ef4d to
9c8eafd
Compare
|
This is now the same as #552 but for the |
* ServiceAuth and NullAuth (#512) * ServiceAuth and NullAuth * fix --------- Co-authored-by: zenground0 <[email protected]> * Fix/public service sql mig (#516) * Add sql migration function to enable public service with dummy key * Fix typo --------- Co-authored-by: zenground0 <[email protected]> * Tigher busy waiting (#521) Co-authored-by: zenground0 <[email protected]> * fix (#523) Co-authored-by: zenground0 <[email protected]> * Look in pdp piecerefs to check for actual readiness (#525) * Look in pdp piecerefs to check for actual readiness * fix --------- Co-authored-by: zenground0 <[email protected]> * YOLO (#526) Co-authored-by: zenground0 <[email protected]> * Remove unpadding transformation from root size (#529) Co-authored-by: zenground0 <[email protected]> * feat(pdp): add transaction tracking for root additions (synapse branch) (#530) * feat(pdp): add transaction tracking for root additions - Add Location header to root addition response with tx hash - Add GET /proof-sets/{id}/roots/added/{txHash} endpoint for status checks - Implement handleGetProofSetRoot for retrieving root details - Return confirmed root IDs after transaction confirmation * fix(pdp): use lowercase transaction hashes consistently in database operations * fix(pdp): retain root addition records to prevent 404s on status queries Previously, pdp_proofset_root_adds records were deleted after processing, causing clients to receive 404 errors when querying transaction status after successful root additions. This created a poor UX as clients couldn't verify transaction outcomes if they queried too late. This change aligns root addition tracking with proof set creation by: - Adding roots_added column to mark processed records instead of deleting - Updating watcher to UPDATE instead of DELETE when processing completes - Including roots_added status in API response for client visibility The fix ensures transaction status remains queryable indefinitely while maintaining backward compatibility. Existing records will be processed normally with roots_added=FALSE by default. API enhancement: /pdp/proof-sets/{id}/roots/added/{txHash} now includes "rootsAdded" field indicating if roots were fully processed. * fix(pdp): extra checks to make sure the db entries are consistent * fix(pdp): address review feedback * chore(pdp): add more logging; don't bail early on tx error (#549) * chore(pdp): add transaction processing logging * chore(pdp): tweak logging statements * fix(chain): improve chain scheduler robustness and eth tx monitoring (#553) - Add mutex protection and notification timeout handling to CurioChainSched - Implement timeout protection for eth client calls to prevent indefinite blocking - Add comprehensive test coverage for chain scheduler and eth message watcher - Extract interfaces (EthClient, TaskEngine, EthTransactionManager) for testability * chore(logs): log proof data as hex (#551) --------- Co-authored-by: ZenGround0 <[email protected]> Co-authored-by: zenground0 <[email protected]> Co-authored-by: Rod Vagg <[email protected]>
- similiar fix to change in filecoin-project/curio#553
- similiar fix to change in filecoin-project/curio#553
| // Notification timeout for chain updates, if we don't get a notification within this time frame | ||
| // then something must be wrong so we'll attempt to restart | ||
| const notificationTimeout = 60 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the event of null round(s) might this timeout be too aggressive?
e.g. with logging added:
2025-08-01T16:12:30.058Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891279"}
2025-08-01T16:12:30.059Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.001100011, "apply": "2891279"}
2025-08-01T16:13:36.247Z WARN scheduler/chain chainsched/chain.go:159 no notifications received in 1m0s, resubscribing to ChainNotify
2025-08-01T16:13:36.248Z INFO scheduler/chain chainsched/chain.go:94 restarting Scheduler with new notification channel
2025-08-01T16:13:36.249Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891279"}
2025-08-01T16:13:36.250Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.001087254, "apply": "2891279"}
2025-08-01T16:13:59.639Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891282"}
2025-08-01T16:13:59.640Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.000919186, "apply": "2891282"}
2025-08-01T16:14:04.084Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891282"}
2025-08-01T16:14:04.085Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.000956019, "apply": "2891282"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we probably should be more conservative here, I'll open a PR.
| log.Debugf("CurioChainSched ticker: %s since last notification", since) | ||
| if since > notificationTimeout { | ||
| log.Warnf("no notifications received in %s, resubscribing to ChainNotify", notificationTimeout) | ||
| notificationCh = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the event this case occurs, I believe the context passed to ChainNotify will need to be canceled, else the go routine in lotus will continue to send notifications to the subscription channel, eventually resulting in warnings from go-jsonrpc like:
2025-08-01T16:17:04.386Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 11}
Then, per my other comment above, when null round(s) occurs again, we'll create another subscription, resulting in more buffering in go-jsonrpc, e.g.:
2025-08-01T17:11:03.778Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 16}
2025-08-01T17:11:03.778Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 14}
2025-08-01T17:11:03.780Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 113}
2025-08-01T17:11:03.780Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 143}
2025-08-01T17:11:03.780Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891396"}
2025-08-01T17:11:03.781Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.000847668, "apply": "2891396"}
2025-08-01T17:12:36.247Z WARN scheduler/chain chainsched/chain.go:159 no notifications received in 1m0s, resubscribing to ChainNotify
2025-08-01T17:12:36.248Z INFO scheduler/chain chainsched/chain.go:94 restarting Scheduler with new notification channel
2025-08-01T17:12:36.249Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891396"}
2025-08-01T17:12:36.250Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.00077717, "apply": "2891396"}
2025-08-01T17:14:00.213Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 17}
2025-08-01T17:14:00.213Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 15}
2025-08-01T17:14:00.213Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 114}
2025-08-01T17:14:00.213Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 144}
| notifs, err = s.api.ChainNotify(ctx) | ||
| for ctx.Err() == nil { | ||
| if notificationCh == nil { | ||
| notificationCh, err = s.api.ChainNotify(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to cancel this context, or use a dedicated one for cleanup of subscription channels.
(Edited after confirming this works, cleaning up and adding tests)
Original:
Still failing in watch_eth.go, but I have a little bit more information with additional logging turned on.
Last output for watch_eth.go working on stuff is:
After that we just have this:
Which looks to me like a lock-up here since either return path should log something:
curio/tasks/message/watch_eth.go
Line 114 in 289b2db
At the same time, Lotus shows us:
That path suggests that we hit the moment in the Lotus ChainIndexer where it had a new chain head but it wasn't quite indexed, so it had to wait a little while to see it be indexed, which should be pretty quick but there's a 5 second timeout on the wait. That particular timeout isn't what's returned here though, it's a timeout from somewhere up the stack, I believe go-jsonrpc, maybe a deadline in the websocket handler, although I can't figure out exactly where. I think it might be the generic timeout that go-jsonrpc sets up for listening for responses which is 30 seconds, but that seems far too long to be relevant here. The timing says Curio asked Lotus at
13:11:00.928for aneth_getTransactionReceiptand Lotus experienced a timeout error at13:11:00.929!The fact that we don't progress further than this might suggest that go-ethereum/ethclient doesn't know how to deal with the error that comes out of Lotus for this event, or maybe one of the websocket handlers (on either side) doesn't know. So my strategy here is to put a 1s timeout per transaction in the processing loop and move on if we don't get a response after that. Hopefully whatever part of the stack is pausing will respond to a context kick.