diff --git a/proposals/images/bot-arch.png b/proposals/images/bot-arch.png new file mode 100644 index 00000000..37566623 Binary files /dev/null and b/proposals/images/bot-arch.png differ diff --git a/proposals/storage-and-retrieval-bots.md b/proposals/storage-and-retrieval-bots.md new file mode 100644 index 00000000..de4851a1 --- /dev/null +++ b/proposals/storage-and-retrieval-bots.md @@ -0,0 +1,170 @@ +# Storage and Retrieval Dealbots + +Authors: @mgoelzer + +Initial PR: #84 + + + + + +## Purpose & impact +#### Background & intent +In some cases, storage and retrieval deals on Filecoin mainnet fail. We do not currently have a good handle on how often this happens, what the causes are, whether it is specific to certain miners, whether miners refuse deals intentional or because of software bugs, etc. + +The Dealbots proposed here aims to address these problems by randomly selecting miners and making deals with them. For instance, the pair of bots can make a storage deal and later attempt to retrieve that same data to understand end-to-end reliability on mainnet. + +The Retrieval Bot (r-bot) can also consume a list of {CID,miner} tuples and attempt retrieval on each one. + +In all cases, we log the success or failure of each storage or retrieval attempt, along with diagnostic information such as where in the sequence the failure occurred, what the error message was and what the Lotus log tail contained. + +#### Assumptions & hypotheses +_What must be true for this project to matter?_ + + - Some storage and retrieval deals fail on mainnet + - This is happening for multiple reasons: code bugs that prevent storage or retrieval from running to successful completion, miners intentionally refusing certain types of deals (certain sizes, or an asymmetry between servicing storage vs retrieval deals). + - Understanding the different types of failure and their frequencies will help us find bugs in Lotus. + - Understanding the same will help us understand if miner economic incentives are suboptimal. + - Providing a tool that can aggregate data across many miners will provide a foundatioin for third parties to run miner reputation systems + +#### User workflow example + +``` +$ ./dealbot --input path/to/deals/to/try.json +{ + "status":"failure", + "failedAt":"ClientEventProviderCanceled", // failure event + "eventList": + [ + "Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)", + "Recv: 0 B, Paid 0 FIL, ClientEventDealProposed (DealStatusWaitForAcceptance)". + "Recv: 0 B, Paid 0 FIL, ClientEventDealAccepted (DealStatusAccepted)". + "Recv: 0 B, Paid 0 FIL, ClientEventPaymentChannelAddingFunds (DealStatusPaymentChannelAllocatingLane)". + "Recv: 0 B, Paid 0 FIL, ClientEventLaneAllocated (DealStatusOngoing)". + "Recv: 0 B, Paid 0 FIL, ClientEventProviderCancelled (DealStatusCancelling)". + "Recv: 0 B, Paid 0 FIL, ClientEventDataTransferError (DealStatusErrored)". + "Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)". + ], + "errorMessage":"ERROR: retrieval failed: Retrieve: Retrieval Error: error generated by data transfer: unable to send cancel to channel FSM: normal shutdown of state machine", + "tailLog":"....", // Multiline, from `tail` of `lotus daemon` + "storageDealParameters": // Given to RetrievalBot as input + { + "CID":"Qm...", + "sha256":"73cb385...", // independent checksum of data file + "sizeInBytes":"12345678". + "minderId":"f01924", + "verified":true, + "fastRetrievalFlag":true, + "dealId":"...", + } + "lotusVersion":"1.5.3-rc2+mainnet+git.9afb5ff94", + // Call API `Filecoin.Version` + "datetime":"YYYY-MM-DD_HH:MM:SS", // when attempt started +}, +{ + // ...next deal attempt json blob... +} +``` + +Stdout will contain the results, in json, of each deal attempt. It is intended to be piped into a log search service like those provided by AWS/GC. + + +#### Impact +_How would this directly contribute to web3 dev stack product-market fit?_ + + - Improve reliability of the network + - Enable an ecosystem of miner reputation and ranking systems + - Perform the retrieval verification in Slingshot 2.3 + +#### Leverage +_How much would nailing this project improve our knowledge and ability to execute future projects?_ + +**Immensely!** + + - We don't currently have enough information about why deals fail to allocate our debugging time and resources correctly. + + - Miner reputation systems enabled by this tool would compliment the protocol-level incentives for miners to "do the right thing" (provide reliable retrieval of previously stored data, successfully complete all storage deals, etc) + +#### Confidence +_How sure are we that this impact would be realized? Label from [this scale](https://medium.com/@nimay/inside-product-introduction-to-feature-priority-using-ice-impact-confidence-ease-and-gist-5180434e5b15)_. + +C = 8 + +Nothing is certain, but it is very likely that building this tool will at a minimum enable the Filecoin Project to better understand the frequency and causes of deal failures. + +And the ability of this tool to support miner reputation systems can only help increase deals that get routed to reliable miners. + + +## Project definition +#### Brief plan of attack + + + - **Phase 1: Retrieval Bot.** Reads stdin describing a CID to attempt to retrieve, writes outcome of retrieval attempt to stdout. + - **Phase 2: Storage Bot.** Same idea but for storage deals. + +In a subsequent project (out of scope for this PR; see [#87](https://github.com/protocol/web3-dev-team/pull/87)), we will create a set of bot orchestrators that invoke the r-bot and s-bot programs with inputs suitable for different use cases. For example, a queue of CIDs to test retrieve to verify the Slingshot 2.3 competition. + +#### What does done look like? +_What specific deliverables should completed to consider this project done?_ + +![High level architecture](https://github.com/protocol/web3-dev-team/blob/bots-proposal/proposals/images/bot-arch.png) + +#### What does success look like? +_Success means impact. How will we know we did the right thing?_ + + - We have a metrics dashboard (in Observable, Grafana, etc) that continuously shows the most recent deal failures, how frequently they are happening, which miners fail most/least, and similar metrics. The impact of this should be obvious: a clearer understanding of why and how often deals are failing on mainnet. + - Reputation systems emerge from ecosystem partners that use the data generated by running these bots to rank miners. This would give Filecoin users reliable, real-time miner ranking, which does not currently exist in the ecosystem. + +#### Counterpoints & pre-mortem +_Why might this project be lower impact than expected? How could this project fail to complete, or fail to be successful?_ + + - The metrics fail to give us actionable debugging ideas + - Reputation systems develop their own code to capture the same miner statistics (duplication of effort) + +#### Alternatives +_How might this project’s intent be realized in other ways (other than this project proposal)? What other potential solutions can address the same need?_ + + - [@whyrusleeping](https://github.com/whyrusleeping/)'s [Estuary](https://github.com/whyrusleeping/estuary) tool + +#### Dependencies/prerequisites + + + - [filecoin-project/lotus/pull/5833/ +](https://github.com/filecoin-project/lotus/pull/5833/) + +#### Future opportunities + + + - Miner reputation systems as discussed + - IPFS<>Filecoin CID indexing + +## Required resources + +#### Effort estimate + + +TBD with team + +#### Roles / skills needed + + +TBD with team