Skip to content

Conversation

@rvagg
Copy link
Member

@rvagg rvagg commented Jul 1, 2025

  • Add mutex protection and notification timeout handling to CurioChainSched
  • Implement timeout protection for eth client calls to prevent indefinite blocking
  • Add comprehensive test coverage for chain scheduler and eth message watcher
  • Extract interfaces (EthClient, TaskEngine, EthTransactionManager) for testability

(Edited after confirming this works, cleaning up and adding tests)


Original:

Still failing in watch_eth.go, but I have a little bit more information with additional logging turned on.

Last output for watch_eth.go working on stuff is:

2025-06-30T13:11:00.927Z        INFO    curio/message   message/watch_eth.go:104        Processing pending transactions {"count": 12, "machineID": 1}
2025-06-30T13:11:00.927Z        DEBUG   curio/message   message/watch_eth.go:112        Checking transaction    {"txHash": "0x0ee2d4acfb636e314a0a544a84e60873aebaa39c607f500e5ac0b4d009953f6e"}
2025-06-30T13:11:00.928Z        DEBUG   curio/message   message/watch_eth.go:119        Transaction still pending       {"txHash": "0x0ee2d4acfb636e314a0a544a84e60873aebaa39c607f500e5ac0b4d009953f6e"}
2025-06-30T13:11:00.928Z        DEBUG   curio/message   message/watch_eth.go:112        Checking transaction    {"txHash": "0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4"}

After that we just have this:

2025-07-01T01:54:30.072Z        DEBUG   curio/message   message/watch_eth.go:246        Head change received    {"newHeight": "2800283"}
2025-07-01T01:54:30.072Z        DEBUG   curio/message   message/watch_eth.go:251        Update channel full, skipping   {"height": "2800283"}
2025-07-01T01:55:00.080Z        DEBUG   curio/message   message/watch_eth.go:246        Head change received    {"newHeight": "2800284"}
2025-07-01T01:55:00.080Z        DEBUG   curio/message   message/watch_eth.go:251        Update channel full, skipping   {"height": "2800284"}

Which looks to me like a lock-up here since either return path should log something:

receipt, err := mw.api.TransactionReceipt(ctx, txHash)

At the same time, Lotus shows us:

{"level":"error","ts":"2025-06-30T13:11:00.929Z","logger":"node/eth","caller":"eth/transaction.go:414","msg":"failed to lookup transaction hash 0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4 in chain indexer: failed to read data from index: context canceled"}
{"level":"warn","ts":"2025-06-30T13:11:00.929Z","logger":"rpc","caller":"[email protected]/handler.go:421","msg":"error in RPC call to 'eth_getTransactionReceipt': failed to lookup transaction hash 0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4 in chain indexer:\n    github.com/filecoin-project/lotus/node/impl/eth.(*ethTransaction).getCidForTransaction\n        /root/repos/lotus/node/impl/eth/transaction.go:415\n  - failed to read data from index:\n    github.com/filecoin-project/lotus/chain/index.(*SqliteIndexer).readWithHeadIndexWait\n        /root/repos/lotus/chain/index/read.go:88\n  - context canceled"}
{"level":"error","ts":"2025-06-30T13:11:00.929Z","logger":"rpc","caller":"[email protected]/websocket.go:145","msg":"handle me:write tcp4 127.0.0.1:1234->127.0.0.1:37802: use of closed network connection"}

That path suggests that we hit the moment in the Lotus ChainIndexer where it had a new chain head but it wasn't quite indexed, so it had to wait a little while to see it be indexed, which should be pretty quick but there's a 5 second timeout on the wait. That particular timeout isn't what's returned here though, it's a timeout from somewhere up the stack, I believe go-jsonrpc, maybe a deadline in the websocket handler, although I can't figure out exactly where. I think it might be the generic timeout that go-jsonrpc sets up for listening for responses which is 30 seconds, but that seems far too long to be relevant here. The timing says Curio asked Lotus at 13:11:00.928 for an eth_getTransactionReceipt and Lotus experienced a timeout error at 13:11:00.929!

The fact that we don't progress further than this might suggest that go-ethereum/ethclient doesn't know how to deal with the error that comes out of Lotus for this event, or maybe one of the websocket handlers (on either side) doesn't know. So my strategy here is to put a 1s timeout per transaction in the processing loop and move on if we don't get a response after that. Hopefully whatever part of the stack is pausing will respond to a context kick.

@rvagg rvagg changed the title Rvagg/eth client timeout and log fix(eth): eth client timeout while watching Jul 1, 2025
@rvagg
Copy link
Member Author

rvagg commented Jul 2, 2025

Caught one with this PR:

2025-07-01T12:03:00.264Z        DEBUG   curio/message   message/watch_eth.go:119        Checking transaction    {"txHash": "0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"}
2025-07-01T12:03:01.264Z        DEBUG   curio/message   message/watch_eth.go:134        Eth calls timed out - continuing with next tx   {"txHash": "0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"}

And in Lotus:

{"level":"error","ts":"2025-07-01T12:03:00.264Z","logger":"node/eth","caller":"eth/transaction.go:414","msg":"failed to lookup transaction hash 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 in chain indexer: failed to read data from index: context canceled"}
{"level":"warn","ts":"2025-07-01T12:03:00.264Z","logger":"rpc","caller":"[email protected]/handler.go:421","msg":"error in RPC call to 'eth_getTransactionReceipt': failed to lookup transaction hash 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 in chain indexer:\n    github.com/filecoin-project/lotus/node/impl/eth.(*ethTransaction).getCidForTransaction\n        /root/repos/lotus/node/impl/eth/transaction.go:415\n  - failed to read data from index:\n    github.com/filecoin-project/lotus/chain/index.(*SqliteIndexer).readWithHeadIndexWait\n        /root/repos/lotus/chain/index/read.go:88\n  - context canceled"}

But, now that tx is in perpetual pending mode:

$ grep 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 /var/log/curio/daemon.log | grep 'still pending' | wc -l
25742

But it's on chain: https://calibration.filfox.info/en/message/0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83?t=1

Lotus just doesn't know about it:

curl http://localhost:1234/rpc/v1 -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":0,"method":"eth_getTransactionByHash","params":["0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"]}'
{"id":0,"jsonrpc":"2.0","result":null}

But on my Calibnet node:

$ curl http://localhost:1235/rpc/v1 -X POST -H 'Content-Type: application/json' -d
 '{"jsonrpc":"2.0","id":0,"method":"eth_getTransactionByHash","params":["0x3005f4a91b26648cf54d94933d147798885aa38cce2dc6cdebb66f53693cc75d"]}' | jq .result.blockHash
"0xae6133f78a2d8d47814e46e9139807f2dca1e77ec66e309bd83513bf916d9136"

So this looks kind of like a ChainIndexer bug as well.

@rvagg
Copy link
Member Author

rvagg commented Jul 2, 2025

Oh, actually, that tx is from 2025-06-21 17:16:00, which is why it's not in the ChainIndexer or in the blockstore, it's too old. So this is back to the original problem, hitting Lotus, getting that timeout error and then locking up. Our timeout in this PR is just kicking it into action again. This particular transaction is a re-processing of one that was missed previously on this Curio node while it was locked up and splitstore removed it before it could be observed and processed, now it's in an endless loop looking for it and it'll never find it unfortunately. And I don't have a calibnet node I can point it to that has deeper history.

So, a couple of problems here:

  1. Something to do with the Eth transaction lookup; it could be ChainIndexer related or maybe it's something to do with go-jsonrpc, or something else, but go-ethereum/ethclient really doesn't like it and it locks up and we shouldn't be triggering that behaviour on Lotus.
  2. Dealing with the lock-up here. Ideally ethclient wouldn't lock up and would be more robust, but it's not apparently so we have to deal with it. So this PR does that.

I'll observe this a little longer and then tidy it up for potential merge.

@rvagg rvagg force-pushed the rvagg/eth-client-timeout-and-log branch 4 times, most recently from 430f43a to ee0ef4d Compare July 8, 2025 12:21
- Add mutex protection and notification timeout handling to CurioChainSched
- Implement timeout protection for eth client calls to prevent indefinite blocking
- Add comprehensive test coverage for chain scheduler and eth message watcher
- Extract interfaces (EthClient, TaskEngine, EthTransactionManager) for testability
@rvagg rvagg force-pushed the rvagg/eth-client-timeout-and-log branch from ee0ef4d to 9c8eafd Compare July 8, 2025 12:26
@rvagg rvagg changed the title fix(eth): eth client timeout while watching fix(chain): improve chain scheduler robustness and eth tx monitoring (synapse) Jul 8, 2025
@rvagg rvagg marked this pull request as ready for review July 8, 2025 12:27
@rvagg
Copy link
Member Author

rvagg commented Jul 8, 2025

#552 (comment)

This is now the same as #552 but for the synapse branch. This is a confirmed fix for the problems in #548. I'm going to redeploy this branch as is just to be absolutely sure I haven't messed anything up (it's been refactored a little, cleaned up and had tests added, so it's not quite the same as I've had deployed).

@rjan90 rjan90 merged commit 8a35e49 into filecoin-project:synapse Jul 10, 2025
15 checks passed
LexLuthr pushed a commit that referenced this pull request Jul 22, 2025
* ServiceAuth and NullAuth (#512)

* ServiceAuth and NullAuth

* fix

---------

Co-authored-by: zenground0 <[email protected]>

* Fix/public service sql mig (#516)

* Add sql migration function to enable public service with dummy key

* Fix typo

---------

Co-authored-by: zenground0 <[email protected]>

* Tigher busy waiting (#521)

Co-authored-by: zenground0 <[email protected]>

* fix (#523)

Co-authored-by: zenground0 <[email protected]>

* Look in pdp piecerefs to check for actual readiness (#525)

* Look in pdp piecerefs to check for actual readiness

* fix

---------

Co-authored-by: zenground0 <[email protected]>

* YOLO (#526)

Co-authored-by: zenground0 <[email protected]>

* Remove unpadding transformation from root size (#529)

Co-authored-by: zenground0 <[email protected]>

* feat(pdp): add transaction tracking for root additions (synapse branch) (#530)

* feat(pdp): add transaction tracking for root additions

- Add Location header to root addition response with tx hash
- Add GET /proof-sets/{id}/roots/added/{txHash} endpoint for status checks
- Implement handleGetProofSetRoot for retrieving root details
- Return confirmed root IDs after transaction confirmation

* fix(pdp): use lowercase transaction hashes consistently in database operations

* fix(pdp): retain root addition records to prevent 404s on status queries

Previously, pdp_proofset_root_adds records were deleted after processing,
causing clients to receive 404 errors when querying transaction status
after successful root additions. This created a poor UX as clients couldn't
verify transaction outcomes if they queried too late.

This change aligns root addition tracking with proof set creation by:
- Adding roots_added column to mark processed records instead of deleting
- Updating watcher to UPDATE instead of DELETE when processing completes
- Including roots_added status in API response for client visibility

The fix ensures transaction status remains queryable indefinitely while
maintaining backward compatibility. Existing records will be processed
normally with roots_added=FALSE by default.

API enhancement: /pdp/proof-sets/{id}/roots/added/{txHash} now includes
"rootsAdded" field indicating if roots were fully processed.

* fix(pdp): extra checks to make sure the db entries are consistent

* fix(pdp): address review feedback

* chore(pdp): add more logging; don't bail early on tx error (#549)

* chore(pdp): add transaction processing logging

* chore(pdp): tweak logging statements

* fix(chain): improve chain scheduler robustness and eth tx monitoring (#553)

- Add mutex protection and notification timeout handling to CurioChainSched
- Implement timeout protection for eth client calls to prevent indefinite blocking
- Add comprehensive test coverage for chain scheduler and eth message watcher
- Extract interfaces (EthClient, TaskEngine, EthTransactionManager) for testability

* chore(logs): log proof data as hex (#551)

---------

Co-authored-by: ZenGround0 <[email protected]>
Co-authored-by: zenground0 <[email protected]>
Co-authored-by: Rod Vagg <[email protected]>
frrist added a commit to storacha/piri that referenced this pull request Jul 30, 2025
frrist added a commit to storacha/piri that referenced this pull request Jul 30, 2025
Comment on lines +22 to +24
// Notification timeout for chain updates, if we don't get a notification within this time frame
// then something must be wrong so we'll attempt to restart
const notificationTimeout = 60 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event of null round(s) might this timeout be too aggressive?
e.g. with logging added:

2025-08-01T16:12:30.058Z        INFO        scheduler/chain        chainsched/chain.go:169        start chain scheduler update        {"apply": "2891279"}
2025-08-01T16:12:30.059Z        INFO        scheduler/chain        chainsched/chain.go:185        end chain scheduler update        {"duration": 0.001100011, "apply": "2891279"}

2025-08-01T16:13:36.247Z        WARN        scheduler/chain        chainsched/chain.go:159        no notifications received in 1m0s, resubscribing to ChainNotify
2025-08-01T16:13:36.248Z        INFO        scheduler/chain        chainsched/chain.go:94        restarting Scheduler with new notification channel

2025-08-01T16:13:36.249Z        INFO        scheduler/chain        chainsched/chain.go:169        start chain scheduler update        {"apply": "2891279"}
2025-08-01T16:13:36.250Z        INFO        scheduler/chain        chainsched/chain.go:185        end chain scheduler update        {"duration": 0.001087254, "apply": "2891279"}
2025-08-01T16:13:59.639Z        INFO        scheduler/chain        chainsched/chain.go:169        start chain scheduler update        {"apply": "2891282"}
2025-08-01T16:13:59.640Z        INFO        scheduler/chain        chainsched/chain.go:185        end chain scheduler update        {"duration": 0.000919186, "apply": "2891282"}
2025-08-01T16:14:04.084Z        INFO        scheduler/chain        chainsched/chain.go:169        start chain scheduler update        {"apply": "2891282"}
2025-08-01T16:14:04.085Z        INFO        scheduler/chain        chainsched/chain.go:185        end chain scheduler update        {"duration": 0.000956019, "apply": "2891282"}

In this case, blocks 2891280 and 2891281 were null rounds

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we probably should be more conservative here, I'll open a PR.

Comment on lines +150 to +153
log.Debugf("CurioChainSched ticker: %s since last notification", since)
if since > notificationTimeout {
log.Warnf("no notifications received in %s, resubscribing to ChainNotify", notificationTimeout)
notificationCh = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event this case occurs, I believe the context passed to ChainNotify will need to be canceled, else the go routine in lotus will continue to send notifications to the subscription channel, eventually resulting in warnings from go-jsonrpc like:

2025-08-01T16:17:04.386Z        WARN        rpc        [email protected]/client.go:469        rpc output message buffer        {"n": 11}

Then, per my other comment above, when null round(s) occurs again, we'll create another subscription, resulting in more buffering in go-jsonrpc, e.g.:

2025-08-01T17:11:03.778Z        WARN        rpc        [email protected]/client.go:469        rpc output message buffer        {"n": 16}
2025-08-01T17:11:03.778Z        WARN        rpc        [email protected]/client.go:469        rpc output message buffer        {"n": 14}
2025-08-01T17:11:03.780Z        WARN        rpc        [email protected]/client.go:469        rpc output message buffer        {"n": 113}
2025-08-01T17:11:03.780Z        WARN        rpc        [email protected]/client.go:469        rpc output message buffer        {"n": 143}
2025-08-01T17:11:03.780Z        INFO        scheduler/chain        chainsched/chain.go:169        start chain scheduler update        {"apply": "2891396"}
2025-08-01T17:11:03.781Z        INFO        scheduler/chain        chainsched/chain.go:185        end chain scheduler update        {"duration": 0.000847668, "apply": "2891396"}
2025-08-01T17:12:36.247Z        WARN        scheduler/chain        chainsched/chain.go:159        no notifications received in 1m0s, resubscribing to ChainNotify
2025-08-01T17:12:36.248Z        INFO        scheduler/chain        chainsched/chain.go:94        restarting Scheduler with new notification channel
2025-08-01T17:12:36.249Z        INFO        scheduler/chain        chainsched/chain.go:169        start chain scheduler update        {"apply": "2891396"}
2025-08-01T17:12:36.250Z        INFO        scheduler/chain        chainsched/chain.go:185        end chain scheduler update        {"duration": 0.00077717, "apply": "2891396"}
2025-08-01T17:14:00.213Z        WARN        rpc        [email protected]/client.go:469        rpc output message buffer        {"n": 17}
2025-08-01T17:14:00.213Z        WARN        rpc        [email protected]/client.go:469        rpc output message buffer        {"n": 15}
2025-08-01T17:14:00.213Z        WARN        rpc        [email protected]/client.go:469        rpc output message buffer        {"n": 114}
2025-08-01T17:14:00.213Z        WARN        rpc        [email protected]/client.go:469        rpc output message buffer        {"n": 144}

notifs, err = s.api.ChainNotify(ctx)
for ctx.Err() == nil {
if notificationCh == nil {
notificationCh, err = s.api.ChainNotify(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to cancel this context, or use a dedicated one for cleanup of subscription channels.

@rvagg
Copy link
Member Author

rvagg commented Aug 4, 2025

@frrist please review: #575

@rvagg rvagg deleted the rvagg/eth-client-timeout-and-log branch August 4, 2025 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants