fix(chain): improve chain scheduler robustness and eth tx monitoring (synapse) #553

rvagg · 2025-07-01T03:47:39Z

Add mutex protection and notification timeout handling to CurioChainSched
Implement timeout protection for eth client calls to prevent indefinite blocking
Add comprehensive test coverage for chain scheduler and eth message watcher
Extract interfaces (EthClient, TaskEngine, EthTransactionManager) for testability

(Edited after confirming this works, cleaning up and adding tests)

Original:

Still failing in watch_eth.go, but I have a little bit more information with additional logging turned on.

Last output for watch_eth.go working on stuff is:

2025-06-30T13:11:00.927Z        INFO    curio/message   message/watch_eth.go:104        Processing pending transactions {"count": 12, "machineID": 1}
2025-06-30T13:11:00.927Z        DEBUG   curio/message   message/watch_eth.go:112        Checking transaction    {"txHash": "0x0ee2d4acfb636e314a0a544a84e60873aebaa39c607f500e5ac0b4d009953f6e"}
2025-06-30T13:11:00.928Z        DEBUG   curio/message   message/watch_eth.go:119        Transaction still pending       {"txHash": "0x0ee2d4acfb636e314a0a544a84e60873aebaa39c607f500e5ac0b4d009953f6e"}
2025-06-30T13:11:00.928Z        DEBUG   curio/message   message/watch_eth.go:112        Checking transaction    {"txHash": "0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4"}

After that we just have this:

2025-07-01T01:54:30.072Z        DEBUG   curio/message   message/watch_eth.go:246        Head change received    {"newHeight": "2800283"}
2025-07-01T01:54:30.072Z        DEBUG   curio/message   message/watch_eth.go:251        Update channel full, skipping   {"height": "2800283"}
2025-07-01T01:55:00.080Z        DEBUG   curio/message   message/watch_eth.go:246        Head change received    {"newHeight": "2800284"}
2025-07-01T01:55:00.080Z        DEBUG   curio/message   message/watch_eth.go:251        Update channel full, skipping   {"height": "2800284"}

Which looks to me like a lock-up here since either return path should log something:

curio/tasks/message/watch_eth.go

Line 114 in 289b2db

receipt, err := mw.api.TransactionReceipt(ctx, txHash)

At the same time, Lotus shows us:

{"level":"error","ts":"2025-06-30T13:11:00.929Z","logger":"node/eth","caller":"eth/transaction.go:414","msg":"failed to lookup transaction hash 0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4 in chain indexer: failed to read data from index: context canceled"}
{"level":"warn","ts":"2025-06-30T13:11:00.929Z","logger":"rpc","caller":"[email protected]/handler.go:421","msg":"error in RPC call to 'eth_getTransactionReceipt': failed to lookup transaction hash 0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4 in chain indexer:\n    github.com/filecoin-project/lotus/node/impl/eth.(*ethTransaction).getCidForTransaction\n        /root/repos/lotus/node/impl/eth/transaction.go:415\n  - failed to read data from index:\n    github.com/filecoin-project/lotus/chain/index.(*SqliteIndexer).readWithHeadIndexWait\n        /root/repos/lotus/chain/index/read.go:88\n  - context canceled"}
{"level":"error","ts":"2025-06-30T13:11:00.929Z","logger":"rpc","caller":"[email protected]/websocket.go:145","msg":"handle me:write tcp4 127.0.0.1:1234->127.0.0.1:37802: use of closed network connection"}

That path suggests that we hit the moment in the Lotus ChainIndexer where it had a new chain head but it wasn't quite indexed, so it had to wait a little while to see it be indexed, which should be pretty quick but there's a 5 second timeout on the wait. That particular timeout isn't what's returned here though, it's a timeout from somewhere up the stack, I believe go-jsonrpc, maybe a deadline in the websocket handler, although I can't figure out exactly where. I think it might be the generic timeout that go-jsonrpc sets up for listening for responses which is 30 seconds, but that seems far too long to be relevant here. The timing says Curio asked Lotus at 13:11:00.928 for an eth_getTransactionReceipt and Lotus experienced a timeout error at 13:11:00.929!

The fact that we don't progress further than this might suggest that go-ethereum/ethclient doesn't know how to deal with the error that comes out of Lotus for this event, or maybe one of the websocket handlers (on either side) doesn't know. So my strategy here is to put a 1s timeout per transaction in the processing loop and move on if we don't get a response after that. Hopefully whatever part of the stack is pausing will respond to a context kick.

rvagg · 2025-07-02T06:10:27Z

Caught one with this PR:

2025-07-01T12:03:00.264Z        DEBUG   curio/message   message/watch_eth.go:119        Checking transaction    {"txHash": "0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"}
2025-07-01T12:03:01.264Z        DEBUG   curio/message   message/watch_eth.go:134        Eth calls timed out - continuing with next tx   {"txHash": "0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"}

And in Lotus:

{"level":"error","ts":"2025-07-01T12:03:00.264Z","logger":"node/eth","caller":"eth/transaction.go:414","msg":"failed to lookup transaction hash 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 in chain indexer: failed to read data from index: context canceled"}
{"level":"warn","ts":"2025-07-01T12:03:00.264Z","logger":"rpc","caller":"[email protected]/handler.go:421","msg":"error in RPC call to 'eth_getTransactionReceipt': failed to lookup transaction hash 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 in chain indexer:\n    github.com/filecoin-project/lotus/node/impl/eth.(*ethTransaction).getCidForTransaction\n        /root/repos/lotus/node/impl/eth/transaction.go:415\n  - failed to read data from index:\n    github.com/filecoin-project/lotus/chain/index.(*SqliteIndexer).readWithHeadIndexWait\n        /root/repos/lotus/chain/index/read.go:88\n  - context canceled"}

But, now that tx is in perpetual pending mode:

$ grep 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 /var/log/curio/daemon.log | grep 'still pending' | wc -l
25742

But it's on chain: https://calibration.filfox.info/en/message/0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83?t=1

Lotus just doesn't know about it:

curl http://localhost:1234/rpc/v1 -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":0,"method":"eth_getTransactionByHash","params":["0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"]}'
{"id":0,"jsonrpc":"2.0","result":null}

But on my Calibnet node:

$ curl http://localhost:1235/rpc/v1 -X POST -H 'Content-Type: application/json' -d
 '{"jsonrpc":"2.0","id":0,"method":"eth_getTransactionByHash","params":["0x3005f4a91b26648cf54d94933d147798885aa38cce2dc6cdebb66f53693cc75d"]}' | jq .result.blockHash
"0xae6133f78a2d8d47814e46e9139807f2dca1e77ec66e309bd83513bf916d9136"

So this looks kind of like a ChainIndexer bug as well.

rvagg · 2025-07-02T06:31:38Z

Oh, actually, that tx is from 2025-06-21 17:16:00, which is why it's not in the ChainIndexer or in the blockstore, it's too old. So this is back to the original problem, hitting Lotus, getting that timeout error and then locking up. Our timeout in this PR is just kicking it into action again. This particular transaction is a re-processing of one that was missed previously on this Curio node while it was locked up and splitstore removed it before it could be observed and processed, now it's in an endless loop looking for it and it'll never find it unfortunately. And I don't have a calibnet node I can point it to that has deeper history.

So, a couple of problems here:

Something to do with the Eth transaction lookup; it could be ChainIndexer related or maybe it's something to do with go-jsonrpc, or something else, but go-ethereum/ethclient really doesn't like it and it locks up and we shouldn't be triggering that behaviour on Lotus.
Dealing with the lock-up here. Ideally ethclient wouldn't lock up and would be more robust, but it's not apparently so we have to deal with it. So this PR does that.

I'll observe this a little longer and then tidy it up for potential merge.

- Add mutex protection and notification timeout handling to CurioChainSched - Implement timeout protection for eth client calls to prevent indefinite blocking - Add comprehensive test coverage for chain scheduler and eth message watcher - Extract interfaces (EthClient, TaskEngine, EthTransactionManager) for testability

rvagg · 2025-07-08T12:44:27Z

#552 (comment)

This is now the same as #552 but for the synapse branch. This is a confirmed fix for the problems in #548. I'm going to redeploy this branch as is just to be absolutely sure I haven't messed anything up (it's been refactored a little, cleaned up and had tests added, so it's not quite the same as I've had deployed).

* ServiceAuth and NullAuth (#512) * ServiceAuth and NullAuth * fix --------- Co-authored-by: zenground0 <[email protected]> * Fix/public service sql mig (#516) * Add sql migration function to enable public service with dummy key * Fix typo --------- Co-authored-by: zenground0 <[email protected]> * Tigher busy waiting (#521) Co-authored-by: zenground0 <[email protected]> * fix (#523) Co-authored-by: zenground0 <[email protected]> * Look in pdp piecerefs to check for actual readiness (#525) * Look in pdp piecerefs to check for actual readiness * fix --------- Co-authored-by: zenground0 <[email protected]> * YOLO (#526) Co-authored-by: zenground0 <[email protected]> * Remove unpadding transformation from root size (#529) Co-authored-by: zenground0 <[email protected]> * feat(pdp): add transaction tracking for root additions (synapse branch) (#530) * feat(pdp): add transaction tracking for root additions - Add Location header to root addition response with tx hash - Add GET /proof-sets/{id}/roots/added/{txHash} endpoint for status checks - Implement handleGetProofSetRoot for retrieving root details - Return confirmed root IDs after transaction confirmation * fix(pdp): use lowercase transaction hashes consistently in database operations * fix(pdp): retain root addition records to prevent 404s on status queries Previously, pdp_proofset_root_adds records were deleted after processing, causing clients to receive 404 errors when querying transaction status after successful root additions. This created a poor UX as clients couldn't verify transaction outcomes if they queried too late. This change aligns root addition tracking with proof set creation by: - Adding roots_added column to mark processed records instead of deleting - Updating watcher to UPDATE instead of DELETE when processing completes - Including roots_added status in API response for client visibility The fix ensures transaction status remains queryable indefinitely while maintaining backward compatibility. Existing records will be processed normally with roots_added=FALSE by default. API enhancement: /pdp/proof-sets/{id}/roots/added/{txHash} now includes "rootsAdded" field indicating if roots were fully processed. * fix(pdp): extra checks to make sure the db entries are consistent * fix(pdp): address review feedback * chore(pdp): add more logging; don't bail early on tx error (#549) * chore(pdp): add transaction processing logging * chore(pdp): tweak logging statements * fix(chain): improve chain scheduler robustness and eth tx monitoring (#553) - Add mutex protection and notification timeout handling to CurioChainSched - Implement timeout protection for eth client calls to prevent indefinite blocking - Add comprehensive test coverage for chain scheduler and eth message watcher - Extract interfaces (EthClient, TaskEngine, EthTransactionManager) for testability * chore(logs): log proof data as hex (#551) --------- Co-authored-by: ZenGround0 <[email protected]> Co-authored-by: zenground0 <[email protected]> Co-authored-by: Rod Vagg <[email protected]>

- similiar fix to change in filecoin-project/curio#553

frrist · 2025-08-01T17:27:08Z

lib/chainsched/chain_sched.go

+// Notification timeout for chain updates, if we don't get a notification within this time frame
+// then something must be wrong so we'll attempt to restart
+const notificationTimeout = 60 * time.Second


In the event of null round(s) might this timeout be too aggressive?
e.g. with logging added:

2025-08-01T16:12:30.058Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891279"} 2025-08-01T16:12:30.059Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.001100011, "apply": "2891279"} 2025-08-01T16:13:36.247Z WARN scheduler/chain chainsched/chain.go:159 no notifications received in 1m0s, resubscribing to ChainNotify 2025-08-01T16:13:36.248Z INFO scheduler/chain chainsched/chain.go:94 restarting Scheduler with new notification channel 2025-08-01T16:13:36.249Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891279"} 2025-08-01T16:13:36.250Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.001087254, "apply": "2891279"} 2025-08-01T16:13:59.639Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891282"} 2025-08-01T16:13:59.640Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.000919186, "apply": "2891282"} 2025-08-01T16:14:04.084Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891282"} 2025-08-01T16:14:04.085Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.000956019, "apply": "2891282"}

In this case, blocks 2891280 and 2891281 were null rounds

yes, we probably should be more conservative here, I'll open a PR.

frrist · 2025-08-01T17:34:32Z

lib/chainsched/chain_sched.go

+			log.Debugf("CurioChainSched ticker: %s since last notification", since)
+			if since > notificationTimeout {
+				log.Warnf("no notifications received in %s, resubscribing to ChainNotify", notificationTimeout)
+				notificationCh = nil


In the event this case occurs, I believe the context passed to ChainNotify will need to be canceled, else the go routine in lotus will continue to send notifications to the subscription channel, eventually resulting in warnings from go-jsonrpc like:

2025-08-01T16:17:04.386Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 11}

Then, per my other comment above, when null round(s) occurs again, we'll create another subscription, resulting in more buffering in go-jsonrpc, e.g.:

2025-08-01T17:11:03.778Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 16} 2025-08-01T17:11:03.778Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 14} 2025-08-01T17:11:03.780Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 113} 2025-08-01T17:11:03.780Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 143} 2025-08-01T17:11:03.780Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891396"} 2025-08-01T17:11:03.781Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.000847668, "apply": "2891396"} 2025-08-01T17:12:36.247Z WARN scheduler/chain chainsched/chain.go:159 no notifications received in 1m0s, resubscribing to ChainNotify 2025-08-01T17:12:36.248Z INFO scheduler/chain chainsched/chain.go:94 restarting Scheduler with new notification channel 2025-08-01T17:12:36.249Z INFO scheduler/chain chainsched/chain.go:169 start chain scheduler update {"apply": "2891396"} 2025-08-01T17:12:36.250Z INFO scheduler/chain chainsched/chain.go:185 end chain scheduler update {"duration": 0.00077717, "apply": "2891396"} 2025-08-01T17:14:00.213Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 17} 2025-08-01T17:14:00.213Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 15} 2025-08-01T17:14:00.213Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 114} 2025-08-01T17:14:00.213Z WARN rpc [email protected]/client.go:469 rpc output message buffer {"n": 144}

frrist · 2025-08-01T17:36:34Z

lib/chainsched/chain_sched.go

-			notifs, err = s.api.ChainNotify(ctx)
+	for ctx.Err() == nil {
+		if notificationCh == nil {
+			notificationCh, err = s.api.ChainNotify(ctx)


I think we need to cancel this context, or use a dedicated one for cleanup of subscription channels.

rvagg · 2025-08-04T04:20:03Z

@frrist please review: #575

rvagg changed the title ~~Rvagg/eth client timeout and log~~ fix(eth): eth client timeout while watching Jul 1, 2025

rjan90 mentioned this pull request Jul 2, 2025

Investigate ETH RPC implementation incompatibility causing go-ethereum/ethclient lockup filecoin-project/lotus#13195

Open

rvagg mentioned this pull request Jul 8, 2025

fix(chain): improve chain scheduler robustness and eth tx monitoring #552

Merged

rvagg force-pushed the rvagg/eth-client-timeout-and-log branch 4 times, most recently from 430f43a to ee0ef4d Compare July 8, 2025 12:21

rvagg force-pushed the rvagg/eth-client-timeout-and-log branch from ee0ef4d to 9c8eafd Compare July 8, 2025 12:26

rvagg changed the title ~~fix(eth): eth client timeout while watching~~ fix(chain): improve chain scheduler robustness and eth tx monitoring (synapse) Jul 8, 2025

rvagg marked this pull request as ready for review July 8, 2025 12:27

rvagg mentioned this pull request Jul 8, 2025

PDP: addRoots intermittently fails - chain watcher stops processing transactions #548

Closed

rjan90 added this to the M2: Pandora Alpha on Mainnet milestone Jul 9, 2025

ZenGround0 approved these changes Jul 9, 2025

View reviewed changes

rjan90 merged commit 8a35e49 into filecoin-project:synapse Jul 10, 2025
15 checks passed

frrist added a commit to storacha/piri that referenced this pull request Jul 30, 2025

fix: add timeout to eth api requests to prevent infinite hanging

1cf0823

- similiar fix to change in filecoin-project/curio#553

frrist mentioned this pull request Jul 30, 2025

fix: add timeout to eth api requests to prevent infinite hanging storacha/piri#164

Merged

frrist added a commit to storacha/piri that referenced this pull request Jul 30, 2025

fix: add timeout to eth api requests to prevent infinite hanging

5908d8a

- similiar fix to change in filecoin-project/curio#553

frrist reviewed Aug 1, 2025

View reviewed changes

rvagg mentioned this pull request Aug 4, 2025

fix: extend CurioChainSched timeout, isolate ctx cancellation #575

Merged

rvagg deleted the rvagg/eth-client-timeout-and-log branch August 4, 2025 04:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(chain): improve chain scheduler robustness and eth tx monitoring (synapse) #553

fix(chain): improve chain scheduler robustness and eth tx monitoring (synapse) #553

Uh oh!

rvagg commented Jul 1, 2025 •

edited

Loading

Uh oh!

rvagg commented Jul 2, 2025

Uh oh!

rvagg commented Jul 2, 2025

Uh oh!

rvagg commented Jul 8, 2025

Uh oh!

Uh oh!

frrist Aug 1, 2025

Uh oh!

rvagg Aug 4, 2025

Uh oh!

frrist Aug 1, 2025

Uh oh!

frrist Aug 1, 2025

Uh oh!

rvagg commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(chain): improve chain scheduler robustness and eth tx monitoring (synapse) #553

fix(chain): improve chain scheduler robustness and eth tx monitoring (synapse) #553

Uh oh!

Conversation

rvagg commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rvagg commented Jul 2, 2025

Uh oh!

rvagg commented Jul 2, 2025

Uh oh!

rvagg commented Jul 8, 2025

Uh oh!

Uh oh!

frrist Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

rvagg Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

frrist Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

frrist Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

rvagg commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rvagg commented Jul 1, 2025 •

edited

Loading