Skip to content

Commit b50dc07

Browse files
author
Mike Goelzer
committed
Initial proposal
1 parent 17f6057 commit b50dc07

File tree

2 files changed

+168
-0
lines changed

2 files changed

+168
-0
lines changed

proposals/images/bot-arch.png

149 KB
Loading
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Storage and Retrieval Dealbots
2+
3+
Authors: @mgoelzer
4+
5+
Initial PR: TBD <!-- Reference the PR first proposing this document. Oooh, self-reference! -->
6+
7+
<!--
8+
This template is for a proposal/brief/pitch for a significant project to be undertaken by a Web3 Dev project team.
9+
The goal of project proposals is to help us decide which work to take on, which things are more valuable than other things.
10+
-->
11+
<!--
12+
A proposal should contain enough detail for others to understand how this project contributes to our team’s mission of product-market fit
13+
for our unified stack of protocols, what is included in scope of the project, where to get started if a project team were to take this on,
14+
and any other information relevant for prioritizing this project against others.
15+
It does not need to describe the work in much detail. Most technical design and planning would take place after a proposal is adopted.
16+
Good project scope aims for ~3-5 engineers for 1-3 months (though feel free to suggest larger-scoped projects anyway).
17+
Projects do not include regular day-to-day maintenance and improvement work, e.g. on testing, tooling, validation, code clarity, refactors for future capability, etc.
18+
-->
19+
<!--
20+
For ease of discussion in PRs, consider breaking lines after every sentence or long phrase.
21+
-->
22+
23+
## Purpose &amp; impact
24+
#### Background &amp; intent
25+
In some cases, storage and retrieval deals on Filecoin mainnet fail. We do not currently have a good handle on how often this happens, what the causes are, whether it is specific to certain miners, whether miners refuse deals intentional or because of software bugs, etc.
26+
27+
The Dealbots proposed here aims to address these problems by randomly selecting miners and making deals with them. For instance, the pair of bots can make a storage deal and later attempt to retrieve that same data to understand end-to-end reliability on mainnet.
28+
29+
The Retrieval Bot (r-bot) can also consume a list of {CID,miner} tuples and attempt retrieval on each one.
30+
31+
In all cases, we log the success or failure of each storage or retrieval attempt, along with diagnostic information such as where in the sequence the failure occurred, what the error message was and what the Lotus log tail contained.
32+
33+
#### Assumptions &amp; hypotheses
34+
_What must be true for this project to matter?_
35+
36+
- Some storage and retrieval deals fail on mainnet
37+
- This is happening for multiple reasons: code bugs that prevent storage or retrieval from running to successful completion, miners intentionally refusing certain types of deals (certain sizes, or an asymmetry between servicing storage vs retrieval deals).
38+
- Understanding the different types of failure and their frequencies will help us find bugs in Lotus.
39+
- Understanding the same will help us understand if miner economic incentives are suboptimal.
40+
- Providing a tool that can aggregate data across many miners will provide a foundatioin for third parties to run miner reputation systems
41+
42+
#### User workflow example
43+
44+
```
45+
$ ./dealbot --input path/to/deals/to/try.json
46+
{
47+
"status":"failure",
48+
"failedAt":"ClientEventProviderCanceled", // failure event
49+
"eventList":
50+
[
51+
"Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)",
52+
"Recv: 0 B, Paid 0 FIL, ClientEventDealProposed (DealStatusWaitForAcceptance)".
53+
"Recv: 0 B, Paid 0 FIL, ClientEventDealAccepted (DealStatusAccepted)".
54+
"Recv: 0 B, Paid 0 FIL, ClientEventPaymentChannelAddingFunds (DealStatusPaymentChannelAllocatingLane)".
55+
"Recv: 0 B, Paid 0 FIL, ClientEventLaneAllocated (DealStatusOngoing)".
56+
"Recv: 0 B, Paid 0 FIL, ClientEventProviderCancelled (DealStatusCancelling)".
57+
"Recv: 0 B, Paid 0 FIL, ClientEventDataTransferError (DealStatusErrored)".
58+
"Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)".
59+
],
60+
"errorMessage":"ERROR: retrieval failed: Retrieve: Retrieval Error: error generated by data transfer: unable to send cancel to channel FSM: normal shutdown of state machine",
61+
"tailLog":"....", // Multiline, from `tail` of `lotus daemon`
62+
"storageDealParameters": // Given to RetrievalBot as input
63+
{
64+
"CID":"Qm...",
65+
"sha256":"73cb385...", // independent checksum of data file
66+
"sizeInBytes":"12345678".
67+
"minderId":"f01924",
68+
"verified":true,
69+
"fastRetrievalFlag":true,
70+
"dealId":"...",
71+
}
72+
"lotusVersion":"1.5.3-rc2+mainnet+git.9afb5ff94",
73+
// Call API `Filecoin.Version`
74+
"datetime":"YYYY-MM-DD_HH:MM:SS", // when attempt started
75+
},
76+
{
77+
// ...next deal attempt json blob...
78+
}
79+
```
80+
81+
Stdout will contain the results, in json, of each deal attempt. It is intended to be piped into a log search service like those provided by AWS/GC.
82+
83+
84+
#### Impact
85+
_How would this directly contribute to web3 dev stack product-market fit?_
86+
87+
- Improve reliability of the network
88+
- Enable an ecosystem of miner reputation and ranking systems
89+
- Perform the retrieval verification in Slingshot 2.3
90+
91+
#### Leverage
92+
_How much would nailing this project improve our knowledge and ability to execute future projects?_
93+
94+
**Immensely!**
95+
96+
- We don't currently have enough information about why deals fail to allocate our debugging time and resources correctly.
97+
98+
- Miner reputation systems enabled by this tool would compliment the protocol-level incentives for miners to "do the right thing" (provide reliable retrieval of previously stored data, successfully complete all storage deals, etc)
99+
100+
#### Confidence
101+
_How sure are we that this impact would be realized? Label from [this scale](https://medium.com/@nimay/inside-product-introduction-to-feature-priority-using-ice-impact-confidence-ease-and-gist-5180434e5b15)_.
102+
103+
C = 8
104+
105+
Nothing is certain, but it is very likely that building this tool will at a minimum enable the Filecoin Project to better understand the frequency and causes of deal failures.
106+
107+
And the ability of this tool to support miner reputation systems can only help increase deals that get routed to reliable miners.
108+
109+
110+
## Project definition
111+
#### Brief plan of attack
112+
113+
<!--Briefly describe the milestones/steps/work needed for this project-->
114+
- **Phase 1: Retrieval Bot.** Reads stdin describing a CID to attempt to retrieve, writes outcome of retrieval attempt to stdout.
115+
- **Phase 2: Storage Bot.** Same idea but for storage deals.
116+
- **Phase 3: Orchestrator.** Invokes the r-bot and s-bot programs with inputs one-by-one from a long queue of CIDs to test retrieve, or files to test store, etc.
117+
118+
#### What does done look like?
119+
_What specific deliverables should completed to consider this project done?_
120+
121+
![High level architecture](/protocol/web3-dev-team/blob/bots-proposal/proposals/images/bot-arch.png)
122+
123+
#### What does success look like?
124+
_Success means impact. How will we know we did the right thing?_
125+
126+
- We have a metrics dashboard (in Observable, Grafana, etc) that continuously shows the most recent deal failures, how frequently they are happening, which miners fail most/least, and similar metrics. The impact of this should be obvious: a clearer understanding of why and how often deals are failing on mainnet.
127+
- Reputation systems emerge from ecosystem partners that use the data generated by running these bots to rank miners. This would give Filecoin users reliable, real-time miner ranking, which does not currently exist in the ecosystem.
128+
129+
#### Counterpoints &amp; pre-mortem
130+
_Why might this project be lower impact than expected? How could this project fail to complete, or fail to be successful?_
131+
132+
- The metrics fail to give us actionable debugging ideas
133+
- Reputation systems develop their own code to capture the same miner statistics (duplication of effort)
134+
135+
#### Alternatives
136+
_How might this project’s intent be realized in other ways (other than this project proposal)? What other potential solutions can address the same need?_
137+
138+
- [@whyrusleeping](https://github.com/whyrusleeping/)'s [Estuary](https://github.com/whyrusleeping/estuary) tool
139+
140+
#### Dependencies/prerequisites
141+
<!--List any other projects that are dependencies/prerequisites for this project that is being pitched.-->
142+
143+
- [filecoin-project/lotus/pull/5833/
144+
](https://github.com/filecoin-project/lotus/pull/5833/)
145+
146+
#### Future opportunities
147+
<!--What future projects/opportunities could this project enable?-->
148+
149+
- Miner reputation systems as discussed
150+
151+
## Required resources
152+
153+
#### Effort estimate
154+
<!--T-shirt size rating of the size of the project. If the project might require external collaborators/teams, please note in the roles/skills section below).
155+
For a team of 3-5 people with the appropriate skills:
156+
- Small, 1-2 weeks
157+
- Medium, 3-5 weeks
158+
- Large, 6-10 weeks
159+
- XLarge, >10 weeks
160+
Describe any choices and uncertainty in this scope estimate. (E.g. Uncertainty in the scope until design work is complete, low uncertainty in execution thereafter.)
161+
-->
162+
163+
TBD with team
164+
165+
#### Roles / skills needed
166+
<!--Describe the knowledge/skill-sets and team that are needed for this project (e.g. PM, docs, protocol or library expertise, design expertise, etc.). If this project could be externalized to the community or a team outside PL's direct employment, please note that here.-->
167+
168+
TBD with team

0 commit comments

Comments
 (0)