Skip to content

Dealbot #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 9, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added proposals/images/bot-arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
168 changes: 168 additions & 0 deletions proposals/storage-and-retrieval-bots.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Storage and Retrieval Dealbots

Authors: @mgoelzer

Initial PR: #84

<!--
This template is for a proposal/brief/pitch for a significant project to be undertaken by a Web3 Dev project team.
The goal of project proposals is to help us decide which work to take on, which things are more valuable than other things.
-->
<!--
A proposal should contain enough detail for others to understand how this project contributes to our team’s mission of product-market fit
for our unified stack of protocols, what is included in scope of the project, where to get started if a project team were to take this on,
and any other information relevant for prioritizing this project against others.
It does not need to describe the work in much detail. Most technical design and planning would take place after a proposal is adopted.
Good project scope aims for ~3-5 engineers for 1-3 months (though feel free to suggest larger-scoped projects anyway).
Projects do not include regular day-to-day maintenance and improvement work, e.g. on testing, tooling, validation, code clarity, refactors for future capability, etc.
-->
<!--
For ease of discussion in PRs, consider breaking lines after every sentence or long phrase.
-->

## Purpose &amp; impact
#### Background &amp; intent
In some cases, storage and retrieval deals on Filecoin mainnet fail. We do not currently have a good handle on how often this happens, what the causes are, whether it is specific to certain miners, whether miners refuse deals intentional or because of software bugs, etc.

The Dealbots proposed here aims to address these problems by randomly selecting miners and making deals with them. For instance, the pair of bots can make a storage deal and later attempt to retrieve that same data to understand end-to-end reliability on mainnet.

The Retrieval Bot (r-bot) can also consume a list of {CID,miner} tuples and attempt retrieval on each one.

In all cases, we log the success or failure of each storage or retrieval attempt, along with diagnostic information such as where in the sequence the failure occurred, what the error message was and what the Lotus log tail contained.

#### Assumptions &amp; hypotheses
_What must be true for this project to matter?_

- Some storage and retrieval deals fail on mainnet
- This is happening for multiple reasons: code bugs that prevent storage or retrieval from running to successful completion, miners intentionally refusing certain types of deals (certain sizes, or an asymmetry between servicing storage vs retrieval deals).
- Understanding the different types of failure and their frequencies will help us find bugs in Lotus.
- Understanding the same will help us understand if miner economic incentives are suboptimal.
- Providing a tool that can aggregate data across many miners will provide a foundatioin for third parties to run miner reputation systems

#### User workflow example

```
$ ./dealbot --input path/to/deals/to/try.json
{
"status":"failure",
"failedAt":"ClientEventProviderCanceled", // failure event
"eventList":
[
"Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)",
"Recv: 0 B, Paid 0 FIL, ClientEventDealProposed (DealStatusWaitForAcceptance)".
"Recv: 0 B, Paid 0 FIL, ClientEventDealAccepted (DealStatusAccepted)".
"Recv: 0 B, Paid 0 FIL, ClientEventPaymentChannelAddingFunds (DealStatusPaymentChannelAllocatingLane)".
"Recv: 0 B, Paid 0 FIL, ClientEventLaneAllocated (DealStatusOngoing)".
"Recv: 0 B, Paid 0 FIL, ClientEventProviderCancelled (DealStatusCancelling)".
"Recv: 0 B, Paid 0 FIL, ClientEventDataTransferError (DealStatusErrored)".
"Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)".
],
"errorMessage":"ERROR: retrieval failed: Retrieve: Retrieval Error: error generated by data transfer: unable to send cancel to channel FSM: normal shutdown of state machine",
"tailLog":"....", // Multiline, from `tail` of `lotus daemon`
"storageDealParameters": // Given to RetrievalBot as input
{
"CID":"Qm...",
"sha256":"73cb385...", // independent checksum of data file
"sizeInBytes":"12345678".
"minderId":"f01924",
"verified":true,
"fastRetrievalFlag":true,
"dealId":"...",
}
"lotusVersion":"1.5.3-rc2+mainnet+git.9afb5ff94",
// Call API `Filecoin.Version`
"datetime":"YYYY-MM-DD_HH:MM:SS", // when attempt started
},
{
// ...next deal attempt json blob...
}
```

Stdout will contain the results, in json, of each deal attempt. It is intended to be piped into a log search service like those provided by AWS/GC.


#### Impact
_How would this directly contribute to web3 dev stack product-market fit?_

- Improve reliability of the network
- Enable an ecosystem of miner reputation and ranking systems
- Perform the retrieval verification in Slingshot 2.3

#### Leverage
_How much would nailing this project improve our knowledge and ability to execute future projects?_

**Immensely!**

- We don't currently have enough information about why deals fail to allocate our debugging time and resources correctly.

- Miner reputation systems enabled by this tool would compliment the protocol-level incentives for miners to "do the right thing" (provide reliable retrieval of previously stored data, successfully complete all storage deals, etc)

#### Confidence
_How sure are we that this impact would be realized? Label from [this scale](https://medium.com/@nimay/inside-product-introduction-to-feature-priority-using-ice-impact-confidence-ease-and-gist-5180434e5b15)_.

C = 8

Nothing is certain, but it is very likely that building this tool will at a minimum enable the Filecoin Project to better understand the frequency and causes of deal failures.

And the ability of this tool to support miner reputation systems can only help increase deals that get routed to reliable miners.


## Project definition
#### Brief plan of attack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ Noting that this project doesn't list

  • Unit tests
  • Monitoring of ongoing functionality of the bots
  • Documentation for usage by external entities

With these, we should expect this project to be at least a 'medium' or to take at least 4 weeks of time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👍 on tests and docs.

For ongoing monitoring, the dashboards project might inadvertently solve for that. If the bots start failing, it will probably be immediately apparent to viewers of the dashboards (e.g., all metrics suddenly go to zero).


<!--Briefly describe the milestones/steps/work needed for this project-->
- **Phase 1: Retrieval Bot.** Reads stdin describing a CID to attempt to retrieve, writes outcome of retrieval attempt to stdout.
- **Phase 2: Storage Bot.** Same idea but for storage deals.
- **Phase 3: Orchestrator.** Invokes the r-bot and s-bot programs with inputs one-by-one from a long queue of CIDs to test retrieve, or files to test store, etc.

#### What does done look like?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do expect to be running this code for some time, and that is comes with both an infra and code burden.

Are there other criteria or thoughts that the stewards have for thinking about what they'd like to see before the project team moves on? cc @BigLep

_What specific deliverables should completed to consider this project done?_

![High level architecture](https://github.com/protocol/web3-dev-team/blob/bots-proposal/proposals/images/bot-arch.png)

#### What does success look like?
_Success means impact. How will we know we did the right thing?_

- We have a metrics dashboard (in Observable, Grafana, etc) that continuously shows the most recent deal failures, how frequently they are happening, which miners fail most/least, and similar metrics. The impact of this should be obvious: a clearer understanding of why and how often deals are failing on mainnet.
- Reputation systems emerge from ecosystem partners that use the data generated by running these bots to rank miners. This would give Filecoin users reliable, real-time miner ranking, which does not currently exist in the ecosystem.

#### Counterpoints &amp; pre-mortem
_Why might this project be lower impact than expected? How could this project fail to complete, or fail to be successful?_

- The metrics fail to give us actionable debugging ideas
- Reputation systems develop their own code to capture the same miner statistics (duplication of effort)

#### Alternatives
_How might this project’s intent be realized in other ways (other than this project proposal)? What other potential solutions can address the same need?_

- [@whyrusleeping](https://github.com/whyrusleeping/)'s [Estuary](https://github.com/whyrusleeping/estuary) tool

#### Dependencies/prerequisites
<!--List any other projects that are dependencies/prerequisites for this project that is being pitched.-->

- [filecoin-project/lotus/pull/5833/
](https://github.com/filecoin-project/lotus/pull/5833/)

#### Future opportunities
<!--What future projects/opportunities could this project enable?-->

- Miner reputation systems as discussed

## Required resources

#### Effort estimate
<!--T-shirt size rating of the size of the project. If the project might require external collaborators/teams, please note in the roles/skills section below).
For a team of 3-5 people with the appropriate skills:
- Small, 1-2 weeks
- Medium, 3-5 weeks
- Large, 6-10 weeks
- XLarge, >10 weeks
Describe any choices and uncertainty in this scope estimate. (E.g. Uncertainty in the scope until design work is complete, low uncertainty in execution thereafter.)
-->

TBD with team

#### Roles / skills needed
<!--Describe the knowledge/skill-sets and team that are needed for this project (e.g. PM, docs, protocol or library expertise, design expertise, etc.). If this project could be externalized to the community or a team outside PL's direct employment, please note that here.-->

TBD with team