Skip to content

Commit 1328dad

Browse files
committed
service documentation
1 parent d9d0569 commit 1328dad

File tree

3 files changed

+207
-58
lines changed

3 files changed

+207
-58
lines changed

statediff/builder_test.go

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,6 @@ import (
3232
sdtypes "github.com/ethereum/go-ethereum/statediff/types"
3333
)
3434

35-
// TODO: add test that filters on address
3635
var (
3736
contractLeafKey []byte
3837
emptyDiffs = make([]sdtypes.StateNode, 0)

statediff/doc.go

Lines changed: 0 additions & 57 deletions
This file was deleted.

statediff/doc.md

Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
This package provides an auxiliary service that asynchronously processes state diff objects from chain events,
2+
either relaying the state objects to rpc subscribers or writing them directly to Postgres.
3+
4+
It also exposes RPC endpoints for fetching or writing to Postgres the state diff `StateObject` at a specific block height
5+
or for a specific block hash, this operates on historic block and state data and so is dependent on having a complete state archive.
6+
7+
# Statediff Object
8+
A state diff `StateObject` is the collection of all the state and storage trie nodes that have been updated in a given block.
9+
For convenience, we also associate these nodes with the block number and hash, and optionally the set of code hashes and code for any
10+
contracts deployed in this block.
11+
12+
A complete state diff `StateObject` will include all state and storage intermediate nodes, which is necessary for generating proofs and for
13+
traversing the tries.
14+
15+
```go
16+
// StateObject is a collection of state (and linked storage nodes) as well as the associated block number, block hash,
17+
// and a set of code hashes and their code
18+
type StateObject struct {
19+
BlockNumber *big.Int `json:"blockNumber" gencodec:"required"`
20+
BlockHash common.Hash `json:"blockHash" gencodec:"required"`
21+
Nodes []StateNode `json:"nodes" gencodec:"required"`
22+
CodeAndCodeHashes []CodeAndCodeHash `json:"codeMapping"`
23+
}
24+
25+
// StateNode holds the data for a single state diff node
26+
type StateNode struct {
27+
NodeType NodeType `json:"nodeType" gencodec:"required"`
28+
Path []byte `json:"path" gencodec:"required"`
29+
NodeValue []byte `json:"value" gencodec:"required"`
30+
StorageNodes []StorageNode `json:"storage"`
31+
LeafKey []byte `json:"leafKey"`
32+
}
33+
34+
// StorageNode holds the data for a single storage diff node
35+
type StorageNode struct {
36+
NodeType NodeType `json:"nodeType" gencodec:"required"`
37+
Path []byte `json:"path" gencodec:"required"`
38+
NodeValue []byte `json:"value" gencodec:"required"`
39+
LeafKey []byte `json:"leafKey"`
40+
}
41+
42+
// CodeAndCodeHash struct for holding codehash => code mappings
43+
// we can't use an actual map because they are not rlp serializable
44+
type CodeAndCodeHash struct {
45+
Hash common.Hash `json:"codeHash"`
46+
Code []byte `json:"code"`
47+
}
48+
```
49+
These objects are packed into a `Payload` structure which additionally associates the StateObject
50+
with the block (header, uncles, and transactions), receipts, and total difficulty.
51+
This `Payload` encapsulates all the block and state data at a given block, and allows us to index the entire Ethereum data structure
52+
as hash-linked IPLD objects.
53+
54+
```go
55+
// Payload packages the data to send to state diff subscriptions
56+
type Payload struct {
57+
BlockRlp []byte `json:"blockRlp"`
58+
TotalDifficulty *big.Int `json:"totalDifficulty"`
59+
ReceiptsRlp []byte `json:"receiptsRlp"`
60+
StateObjectRlp []byte `json:"stateObjectRlp" gencodec:"required"`
61+
62+
encoded []byte
63+
err error
64+
}
65+
```
66+
67+
# Usage
68+
This state diffing service runs as an auxiliary service concurrent to the regular syncing process of the geth node.
69+
70+
71+
## CLI configuration
72+
This service introduces a CLI flag namespace `statediff`
73+
74+
`--statediff` flag is used to turn on the service
75+
`--statediff.writing` is used to tell the service to write state diff objects it produces from synced ChainEvents directly to a configured Postgres database
76+
`--statediff.db` is the connection string for the Postgres database to write to
77+
`--statediff.dbnodeid` is the node id to use in the Postgres database
78+
`--statediff.dbclientname` is the client name to use in the Postgres database
79+
80+
The service can only operate in full sync mode (`--syncmode=full`), but only the historical RPC endpoints require an archive node (`--gcmode=archive`)
81+
82+
e.g.
83+
`
84+
./build/bin/geth --syncmode=full --gcmode=archive --statediff --statediff.writing --statediff.db=postgres://localhost:5432/vulcanize_testing?sslmode=disable --statediff.dbnodeid={nodeId} --statediff.dbclientname={dbClientName}
85+
`
86+
87+
## RPC endpoints
88+
The state diffing service exposes both a WS subscription endpoint, and a number of HTTP unary endpoints.
89+
90+
Each of these endpoints requires a set of parameters provided by the caller
91+
92+
```go
93+
// Params is used to carry in parameters from subscribing/requesting clients configuration
94+
type Params struct {
95+
IntermediateStateNodes bool
96+
IntermediateStorageNodes bool
97+
IncludeBlock bool
98+
IncludeReceipts bool
99+
IncludeTD bool
100+
IncludeCode bool
101+
WatchedAddresses []common.Address
102+
WatchedStorageSlots []common.Hash
103+
}
104+
```
105+
106+
Using these params we can tell the service whether to include state and/or storage intermediate nodes; whether
107+
to include the associated block (header, uncles, and transactions); whether to include the associated receipts;
108+
whether to include the total difficult for this block; whether to include the set of code hashes and code for
109+
contracts deployed in this block; whether to limit the diffing process to a list of specific addresses; and/or
110+
whether to limit the diffing process to a list of specific storage slot keys.
111+
112+
### Subscription endpoint
113+
A websocket supporting RPC endpoint is exposed for subscribing to state diff `StateObjects` that come off the head of the chain while the geth node syncs.
114+
115+
```go
116+
// Stream is a subscription endpoint that fires off state diff payloads as they are created
117+
Stream(ctx context.Context, params Params) (*rpc.Subscription, error)
118+
```
119+
120+
To expose this endpoint the node needs to have the websocket server turned on (`--ws`),
121+
and the `statediff` namespace exposed (`--ws.api=statediff`).
122+
123+
Go code subscriptions to this endpoint can be created using the `rpc.Client.Subscribe()` method,
124+
with the "statediff" namespace, a `statediff.Payload` channel, and the name of the statediff api's rpc method: "stream".
125+
126+
e.g.
127+
128+
```go
129+
130+
cli, err := rpc.Dial("ipcPathOrWsURL")
131+
if err != nil {
132+
// handle error
133+
}
134+
stateDiffPayloadChan := make(chan statediff.Payload, 20000)
135+
methodName := "stream"
136+
params := statediff.Params{
137+
IncludeBlock: true,
138+
IncludeTD: true,
139+
IncludeReceipts: true,
140+
IntermediateStorageNodes: true,
141+
IntermediateStateNodes: true,
142+
}
143+
rpcSub, err := cli.Subscribe(context.Background(), statediff.APIName, stateDiffPayloadChan, methodName, params)
144+
if err != nil {
145+
// handle error
146+
}
147+
for {
148+
select {
149+
case stateDiffPayload := <- stateDiffPayloadChan:
150+
// process the payload
151+
case err := <- rpcSub.Err():
152+
// handle rpc subscription error
153+
}
154+
}
155+
```
156+
157+
### Unary endpoints
158+
The service also exposes unary RPC endpoints for retrieving the state diff `StateObject` for a specific block height/hash.
159+
```go
160+
// StateDiffAt returns a state diff payload at the specific blockheight
161+
StateDiffAt(ctx context.Context, blockNumber uint64, params Params) (*Payload, error)
162+
163+
// StateDiffFor returns a state diff payload for the specific blockhash
164+
StateDiffFor(ctx context.Context, blockHash common.Hash, params Params) (*Payload, error)
165+
```
166+
167+
To expose this endpoint the node needs to have the HTTP server turned on (`--http`),
168+
and the `statediff` namespace exposed (`--http.api=statediff`).
169+
170+
## Direct indexing into Postgres
171+
If `--statediff.writing` is set, the service will convert the state diff `StateObject` data into IPLD objects, persist them directly to Postgres,
172+
and generate secondary indexes around the IPLD data.
173+
174+
The schema and migrations for this Postgres database are provided in `statediff/db/`.
175+
176+
### Postgres setup
177+
We use [pressly/goose](https://github.com/pressly/goose) as our Postgres migration manager.
178+
You can also load the Postgres schema directly into a database using
179+
180+
`psql database_name < schema.sql`
181+
182+
This will only work on a version 12.4 Postgres database.
183+
184+
### Schema overview
185+
Our Postgres schemas are built around a single IPFS backing Postgres IPLD blockstore table (`public.blocks`) that conforms with [go-ds-sql](https://github.com/ipfs/go-ds-sql/blob/master/postgres/postgres.go).
186+
All IPLD objects are stored in this table, where `key` is blockstore-prefixed multihash key for the IPLD object and `data` contains
187+
the bytes for the IPLD block (in the case of all Ethereum IPLDs, this is the RLP byte encoding of the Ethereum object).
188+
189+
The IPLD objects in this table can be traversed using an IPLD DAG interface, but since this table only maps multihash to raw IPLD object
190+
it is not particularly useful for searching through the data or and does not allow us to look up Ethereum objects by their constituent fields
191+
(e.g. by block number, tx source/recipient, state/storage trie node path). To improve the accessibility of these Ethereum IPLD objects
192+
we generate secondary indexes on top of the raw IPLDs in other Postgres tables. This collection of tables encapsulates an Ethereum [advanced data layout](https://github.com/ipld/specs#schemas-and-advanced-data-layouts) (ADL).
193+
194+
These secondary index tables fall under the `eth` schema and follow an `{objectType}_cids` naming convention.
195+
Each of these tables provides a view into the individual fields of the underlying Ethereum IPLD object and references the raw IPLD object stored in `public.blocks` by multihash foreign key.
196+
Additionally, these tables link up to their parent object tables. E.g. the `storage_cids` table contains a `state_id` foreign key which references the `id`
197+
for the `state_cids` entry that contains the state leaf node for the contract the storage node belongs to, and in turn that `state_cids` entry contains a `header_id`
198+
foreign key which references the `id` of the `header_cids` entry that contains the header for the block these state and storage nodes were updated (diffed).
199+
200+
## Optimization
201+
On mainnet this process is extremely IO intensive and requires significant resources to allow it to keep up with the head of the chain.
202+
The state diff processing time for a specific block is dependent on the number and complexity of the state changes that occur in a block and
203+
the number of updated state nodes that are available in the in-memory cache vs must be retrieved from disc.
204+
205+
If memory permits, one means of improving the efficiency of this process is to increase the trie cache allocation.
206+
This can be done by increasing the overall `--cache` allocation and/or by increasing the % of the cache allocated to trie
207+
usage with `--cache.trie`.

0 commit comments

Comments
 (0)