-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Closed
Labels
area/marketsArea: MarketsArea: Markets
Description
Basic Information
- In
go-fil-marketsv1.3.0 andgo-data-transferv1.5.0, we ‘ve introduced a “channel monitor” that times out storage and retrieval deals if the other peer dosen’t send anAcceptmessage before a stipulated timeout. That happens here. - We’ve begun seeing storage deal errors where the client (Estuary) hits the timeout after proposing a storage deal to the miner.
- We’re also seeing similar errors in the dealbot retrieval deals.
- We even saw the same error for retrieval on a CI test failure but unfortunately the relevant parts of the code weren’t well logged and the DEBUG logs weren’t enabled in the test. We’ve tried reproducing the same error by running the failing test multiple times on the CI and locally but with no luck. It only occurs sporadically.
- We tried moving the retrieval bot to v1.9.0 which doesn't have the Markets v1.3.0 dep but even then , it runs into the same problems. This is likely a regression on the miner because of Markets v1.3.0 which contains some heavy refactoring in the data transfer module related to retrieval restarts.
Describe the problem
For Storage Deals
This describes an occurrence of this problem when @whyrusleeping ran a storage deal via Estuary(depending on go-data-transfer v1.5.0) against @magik6k 's miner(depending on Markets v1.3.0 and data-transfer v1.5.0):
- After a storage client sends a data transfer message asking the provider to open a graphsync pull channel to the client to fetch the data, the client starts a timer to wait for the provider to send an
Acceptedresponse as a part of the Graphsync request the provider will send to the client. - The code where the provider sends this response is at :https://github.com/filecoin-project/go-data-transfer/blob/3a130c3f4d33e422b08a3175748e0c718156b6a5/impl/receiver.go#L57
- The code where the client receives and processes this “Accepted” response is at:https://github.com/filecoin-project/go-data-transfer/blob/3a130c3f4d33e422b08a3175748e0c718156b6a5/impl/events.go#L185
- The Miner accepted the client's Push request , opened a GS Pull Request but the client didn't receive the "Accept" message in time and so it cancelled the transfer and sent the channel cancellation to the Provider. Unfortunately, we don't have the DEBUG logs enabled on the miner and can't get more details here. We can see the miner receiving the cancellation in these logs:
event {"name": "ProviderEventOpen", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealValidating", "message": ""}
2021-05-22T02:40:40.982+0200 �[34mINFO�[0m markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDealDeciding", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealAcceptWait", "message": ""}
2021-05-22T02:40:40.994+0200 �[34mINFO�[0m markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataRequested", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealWaitingForData", "message": ""}
2021-05-22T02:40:41.573+0200 �[34mINFO�[0m dt-impl impl/events.go:298 received new channel request from 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw
2021-05-22T02:40:41.585+0200 �[34mINFO�[0m markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealTransferring", "message": ""}
2021-05-22T02:40:41.589+0200 �[34mINFO�[0m dt_graphsync graphsync/graphsync.go:189 Opening graphsync request to 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw for root QmRKPnQDpiBAv7NVroBKHrwbVDax9yyq4bPfC9DMKx24wK with 0 CIDs already received
2021-05-22T02:40:41.589+0200 �[34mINFO�[0m dt-impl impl/events.go:19 channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWDMpcct12Vb6jPXwjvLQHA2hoP8XKGbUZ2tpue1ydoZUm-1621617623978777806: opened
2021-05-22T02:41:06.045+0200 �[34mINFO�[0m miner miner/miner.go:462 Time delta between now and our mining base: 6s (nulls: 0)
2021-05-22T02:41:06.047+0200 �[34mINFO�[0m gen gen/gen.go:638 completed winAttemptVRF {"beaconRound": 873767, "beaconDataB64": "hftXggo3D73RHhDfYNlmyy5nwSsNF-zzgAxoU0yqbFJyWXOJ2_t9A0XDQna6IJd3D9GgkHVzwJwMUUPEnEAjWkY8cE80LrWlIovh45PYT7Y-AZf2yG9GeqVZcsBshsR7", "electionRandB64": "NCNg7FelNzrXJYYN4XnIAdxMUQWVvDioBoui2tF1t0o", "vrfB64": "mMqdxEU4_ABOVHo2-Abkv3fjBGpeL9Ef4r1DBljT0v4HqDD6ucdRxXzrDaLr8B37F2Pd2AKrGqTpYwzCz0hyTWvgWkeEh-GhgnWt8kjPymb9DQ5L_7t1fnFKN5kLzQSX", "winCount": 0}
2021-05-22T02:41:06.048+0200 �[34mINFO�[0m miner miner/miner.go:436 completed mineOne {"forRound": 777923, "baseEpoch": 777922, "lookbackEpochs": 900, "networkPowerAtLookback": "6371887650916007936", "minerPowerAtLookback": "46751934185472", "isEligible": true, "isWinner": false}
2021-05-22T02:41:11.575+0200 �[34mINFO�[0m dt-impl impl/events.go:137 channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWDMpcct12Vb6jPXwjvLQHA2hoP8XKGbUZ2tpue1ydoZUm-1621617623978777806: received cancel request, cleaning up channel
2021-05-22T02:41:11.575+0200 �[31mERROR�[0m dt_graphsync graphsync/graphsync.go:731 failed to unregister persistence option data-transfer-12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWDMpcct12Vb6jPXwjvLQHA2hoP8XKGbUZ2tpue1ydoZUm-1621617623978777806: cannot unregister while requests are in progress
2021-05-22T02:41:11.576+0200 �[34mINFO�[0m markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataTransferCancelled", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealFailing", "message": "data transfer cancelled"}
2021-05-22T02:41:11.590+0200 �[33mWARN�[0m providerstates providerstates/provider_states.go:536 deal bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq failed: data transfer cancelled
2021-05-22T02:41:11.684+0200 �[34mINFO�[0m markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventFailed", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealError", "message": "data transfer cancelled"}
- The estuary logs simply show the timeout:
transfer failed: 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWKG9mQfyAAyCc3huFVMMzpKCdbgLhED6PYFhwM7q8Ab3W-1621617623978777810: timed out waiting 30s for Accept message from remote peer
- We don't know if the Provider didn't send an
AcceptOR if the client failed to process it correctly after receiving it. - The next step here is for us to add better logging to this part of the code on both the client and miner, turn on debug logging on both and repeat deal making till we see this error again. There is a WIP PR for the logging at [WIP] Feat/debug accept message error go-data-transfer#210.
For Retrieval Deals
- Please see @mgoelzer 's first reporting of the issue at [BUG] Retrieval Error: error generated by data transfer: deal data transfer failed #6299 (comment).
- The interesting thing here is that the client starts receiving the blocks from the Provider but NOT the "Accept" message. This means that the Provider Market was able to access an Unsealed copy of the Piece and send across blocks via the Graphsync protocol on the channel opened by the Miner but the
Acceptresponse is either NOT sent by the Miner in the initial Graphsync response or is NOT processed correctly by the Client.
ubuntu@dealbot-mainnet:~$ lotus client retrieve --miner f0215497 mAXCg5AIgcwUbnEkjzoZgSVAbcO3KXldc/i+2Q3IoGYojEnq1w4s /dev/null
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
> Recv: 0 B, Paid 0 FIL, ClientEventDealProposed (DealStatusWaitForAcceptance)
> Recv: 52.01 KiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 1.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 2.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 3.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 4.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 5.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 6.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
- We thought increasing the timeout for receiving the
Acceptmessage would solve the issue but we've seen instances where theAcceptmessage simply does NOT arrive from the miner. - We don't have access to the miner logs here as the Dealbot is running against miners in the wild. We've managed to reproduce it ONLY once against the Sofia miner but the part of the code where we send/receive the accept for retrieval isn't well logged.
- I've managed to run successful retrievals against Piknik's miner using the same Lotus client version as the Dealbot with the Piknik Miner too running a recent latest tip of master. This problem too occurs sporadically.
- The code where the Retrieval Miner sends the
Acceptmessage as part of the Graphsync response to the clien's Graphsync request is at https://github.com/filecoin-project/go-data-transfer/blob/3a130c3f4d33e422b08a3175748e0c718156b6a5/transport/graphsync/graphsync.go#L446. - The code where the Retrieval client receives and processes the
Acceptmessage is at https://github.com/filecoin-project/go-data-transfer/blob/3a130c3f4d33e422b08a3175748e0c718156b6a5/impl/events.go#L185. - Please note that moving the Dealbot client to v1.9.0 which does NOT depend on Markets v1.3.0 does NOT solve the issue suggesting that the Miner is probably a part of the problem and is running a Lotus version that depends on Markets v1.3.0.
- The next step here is to deploy the PR that logs this code flow on miners that we have access to and run the Dealbot against those miners giving us access to the Debug logs for both the Dealbot and the Miner for when we see this problem again.
Metadata
Metadata
Assignees
Labels
area/marketsArea: MarketsArea: Markets