Skip to content

[Deal Making Issue] Storage AND Retrieval deals time out waiting for an Accept message  #6343

@aarshkshah1992

Description

@aarshkshah1992

Basic Information

  • In go-fil-markets v1.3.0 and go-data-transfer v1.5.0, we ‘ve introduced a “channel monitor” that times out storage and retrieval deals if the other peer dosen’t send an Accept message before a stipulated timeout. That happens here.
  • We’ve begun seeing storage deal errors where the client (Estuary) hits the timeout after proposing a storage deal to the miner.
  • We’re also seeing similar errors in the dealbot retrieval deals.
  • We even saw the same error for retrieval on a CI test failure but unfortunately the relevant parts of the code weren’t well logged and the DEBUG logs weren’t enabled in the test. We’ve tried reproducing the same error by running the failing test multiple times on the CI and locally but with no luck. It only occurs sporadically.
  • We tried moving the retrieval bot to v1.9.0 which doesn't have the Markets v1.3.0 dep but even then , it runs into the same problems. This is likely a regression on the miner because of Markets v1.3.0 which contains some heavy refactoring in the data transfer module related to retrieval restarts.

Describe the problem

For Storage Deals

This describes an occurrence of this problem when @whyrusleeping ran a storage deal via Estuary(depending on go-data-transfer v1.5.0) against @magik6k 's miner(depending on Markets v1.3.0 and data-transfer v1.5.0):

event	{"name": "ProviderEventOpen", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealValidating", "message": ""}
2021-05-22T02:40:40.982+0200	�[34mINFO�[0m	markets	loggers/loggers.go:20	storage provider event	{"name": "ProviderEventDealDeciding", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealAcceptWait", "message": ""}
2021-05-22T02:40:40.994+0200	�[34mINFO�[0m	markets	loggers/loggers.go:20	storage provider event	{"name": "ProviderEventDataRequested", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealWaitingForData", "message": ""}
2021-05-22T02:40:41.573+0200	�[34mINFO�[0m	dt-impl	impl/events.go:298	received new channel request from 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw
2021-05-22T02:40:41.585+0200	�[34mINFO�[0m	markets	loggers/loggers.go:20	storage provider event	{"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealTransferring", "message": ""}
2021-05-22T02:40:41.589+0200	�[34mINFO�[0m	dt_graphsync	graphsync/graphsync.go:189	Opening graphsync request to 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw for root QmRKPnQDpiBAv7NVroBKHrwbVDax9yyq4bPfC9DMKx24wK with 0 CIDs already received
2021-05-22T02:40:41.589+0200	�[34mINFO�[0m	dt-impl	impl/events.go:19	channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWDMpcct12Vb6jPXwjvLQHA2hoP8XKGbUZ2tpue1ydoZUm-1621617623978777806: opened
2021-05-22T02:41:06.045+0200	�[34mINFO�[0m	miner	miner/miner.go:462	Time delta between now and our mining base: 6s (nulls: 0)
2021-05-22T02:41:06.047+0200	�[34mINFO�[0m	gen	gen/gen.go:638	completed winAttemptVRF	{"beaconRound": 873767, "beaconDataB64": "hftXggo3D73RHhDfYNlmyy5nwSsNF-zzgAxoU0yqbFJyWXOJ2_t9A0XDQna6IJd3D9GgkHVzwJwMUUPEnEAjWkY8cE80LrWlIovh45PYT7Y-AZf2yG9GeqVZcsBshsR7", "electionRandB64": "NCNg7FelNzrXJYYN4XnIAdxMUQWVvDioBoui2tF1t0o", "vrfB64": "mMqdxEU4_ABOVHo2-Abkv3fjBGpeL9Ef4r1DBljT0v4HqDD6ucdRxXzrDaLr8B37F2Pd2AKrGqTpYwzCz0hyTWvgWkeEh-GhgnWt8kjPymb9DQ5L_7t1fnFKN5kLzQSX", "winCount": 0}
2021-05-22T02:41:06.048+0200	�[34mINFO�[0m	miner	miner/miner.go:436	completed mineOne	{"forRound": 777923, "baseEpoch": 777922, "lookbackEpochs": 900, "networkPowerAtLookback": "6371887650916007936", "minerPowerAtLookback": "46751934185472", "isEligible": true, "isWinner": false}
2021-05-22T02:41:11.575+0200	�[34mINFO�[0m	dt-impl	impl/events.go:137	channel 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWDMpcct12Vb6jPXwjvLQHA2hoP8XKGbUZ2tpue1ydoZUm-1621617623978777806: received cancel request, cleaning up channel
2021-05-22T02:41:11.575+0200	�[31mERROR�[0m	dt_graphsync	graphsync/graphsync.go:731	failed to unregister persistence option data-transfer-12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWDMpcct12Vb6jPXwjvLQHA2hoP8XKGbUZ2tpue1ydoZUm-1621617623978777806: cannot unregister while requests are in progress
2021-05-22T02:41:11.576+0200	�[34mINFO�[0m	markets	loggers/loggers.go:20	storage provider event	{"name": "ProviderEventDataTransferCancelled", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealFailing", "message": "data transfer cancelled"}
2021-05-22T02:41:11.590+0200	�[33mWARN�[0m	providerstates	providerstates/provider_states.go:536	deal bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq failed: data transfer cancelled
2021-05-22T02:41:11.684+0200	�[34mINFO�[0m	markets	loggers/loggers.go:20	storage provider event	{"name": "ProviderEventFailed", "proposal CID": "bafyreidjpjyepdsvkdpttu2n5z33xfufbnolmrszw5saa4ykqlakrzletq", "state": "StorageDealError", "message": "data transfer cancelled"}
  • The estuary logs simply show the timeout:
transfer failed: 12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw-12D3KooWKG9mQfyAAyCc3huFVMMzpKCdbgLhED6PYFhwM7q8Ab3W-1621617623978777810: timed out waiting 30s for Accept message from remote peer
  • We don't know if the Provider didn't send an Accept OR if the client failed to process it correctly after receiving it.
  • The next step here is for us to add better logging to this part of the code on both the client and miner, turn on debug logging on both and repeat deal making till we see this error again. There is a WIP PR for the logging at [WIP] Feat/debug accept message error go-data-transfer#210.

For Retrieval Deals

  • Please see @mgoelzer 's first reporting of the issue at [BUG] Retrieval Error: error generated by data transfer: deal data transfer failed #6299 (comment).
  • The interesting thing here is that the client starts receiving the blocks from the Provider but NOT the "Accept" message. This means that the Provider Market was able to access an Unsealed copy of the Piece and send across blocks via the Graphsync protocol on the channel opened by the Miner but the Accept response is either NOT sent by the Miner in the initial Graphsync response or is NOT processed correctly by the Client.
ubuntu@dealbot-mainnet:~$ lotus client retrieve --miner f0215497 mAXCg5AIgcwUbnEkjzoZgSVAbcO3KXldc/i+2Q3IoGYojEnq1w4s /dev/null
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
> Recv: 0 B, Paid 0 FIL, ClientEventDealProposed (DealStatusWaitForAcceptance)
> Recv: 52.01 KiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 1.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 2.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 3.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 4.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 5.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
> Recv: 6.051 MiB, Paid 0 FIL, ClientEventBlocksReceived (DealStatusWaitForAcceptance)
  • We thought increasing the timeout for receiving the Accept message would solve the issue but we've seen instances where the Accept message simply does NOT arrive from the miner.
  • We don't have access to the miner logs here as the Dealbot is running against miners in the wild. We've managed to reproduce it ONLY once against the Sofia miner but the part of the code where we send/receive the accept for retrieval isn't well logged.
  • I've managed to run successful retrievals against Piknik's miner using the same Lotus client version as the Dealbot with the Piknik Miner too running a recent latest tip of master. This problem too occurs sporadically.
  • The code where the Retrieval Miner sends the Accept message as part of the Graphsync response to the clien's Graphsync request is at https://github.com/filecoin-project/go-data-transfer/blob/3a130c3f4d33e422b08a3175748e0c718156b6a5/transport/graphsync/graphsync.go#L446.
  • The code where the Retrieval client receives and processes the Accept message is at https://github.com/filecoin-project/go-data-transfer/blob/3a130c3f4d33e422b08a3175748e0c718156b6a5/impl/events.go#L185.
  • Please note that moving the Dealbot client to v1.9.0 which does NOT depend on Markets v1.3.0 does NOT solve the issue suggesting that the Miner is probably a part of the problem and is running a Lotus version that depends on Markets v1.3.0.
  • The next step here is to deploy the PR that logs this code flow on miners that we have access to and run the Dealbot against those miners giving us access to the Debug logs for both the Dealbot and the Miner for when we see this problem again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions