Skip to content

Commit 8a2d69b

Browse files
committed
docs: Add comprehensive guide for dataset inspection
- Full guide covering both Registry and Admin clients - Use cases: interactive exploration, finding specific columns, type inspection, checking nullability, building dynamic queries - Practical examples for finding Ethereum addresses and hashes - Complete API reference with expected output
1 parent a508569 commit 8a2d69b

File tree

1 file changed

+211
-0
lines changed

1 file changed

+211
-0
lines changed

docs/inspecting_datasets.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Inspecting Dataset Schemas
2+
3+
The Registry API provides convenient methods to explore dataset structures without manually parsing manifests.
4+
5+
## Quick Start
6+
7+
```python
8+
from amp.registry import RegistryClient
9+
10+
client = RegistryClient()
11+
12+
# Pretty-print dataset structure
13+
client.datasets.inspect('graphops', 'ethereum-mainnet')
14+
15+
# Get structured schema data
16+
schema = client.datasets.describe('graphops', 'ethereum-mainnet')
17+
```
18+
19+
## Methods
20+
21+
### `inspect(namespace, name, version='latest')`
22+
23+
Pretty-prints the dataset structure in a human-readable format. Perfect for interactive exploration.
24+
25+
**Example Output:**
26+
```
27+
Dataset: graphops/ethereum-mainnet@latest
28+
Description: Ethereum mainnet blockchain data
29+
30+
📊 blocks (21 columns)
31+
• block_num UInt64 NOT NULL
32+
• timestamp Timestamp(Nanosecond) NOT NULL
33+
• hash FixedSizeBinary(32) NOT NULL
34+
• parent_hash FixedSizeBinary(32) NOT NULL
35+
• miner FixedSizeBinary(20) NOT NULL
36+
...
37+
38+
📊 transactions (24 columns)
39+
• block_num UInt64 NOT NULL
40+
• tx_hash FixedSizeBinary(32) NOT NULL
41+
• from FixedSizeBinary(20) NOT NULL
42+
• to FixedSizeBinary(20) NULL
43+
...
44+
```
45+
46+
### `describe(namespace, name, version='latest')`
47+
48+
Returns a structured dictionary mapping table names to column information. Use this for programmatic access.
49+
50+
**Returns:**
51+
```python
52+
{
53+
'blocks': [
54+
{'name': 'block_num', 'type': 'UInt64', 'nullable': False},
55+
{'name': 'timestamp', 'type': 'Timestamp(Nanosecond)', 'nullable': False},
56+
{'name': 'hash', 'type': 'FixedSizeBinary(32)', 'nullable': False},
57+
...
58+
],
59+
'transactions': [
60+
{'name': 'tx_hash', 'type': 'FixedSizeBinary(32)', 'nullable': False},
61+
...
62+
]
63+
}
64+
```
65+
66+
## Use Cases
67+
68+
### 1. Interactive Exploration
69+
70+
```python
71+
# Quickly see what's available
72+
client.datasets.inspect('namespace', 'dataset-name')
73+
```
74+
75+
### 2. Finding Specific Columns
76+
77+
```python
78+
schema = client.datasets.describe('namespace', 'dataset-name')
79+
80+
# Find tables with specific columns
81+
for table_name, columns in schema.items():
82+
col_names = [col['name'] for col in columns]
83+
if 'address' in col_names:
84+
print(f"Table '{table_name}' has an address column")
85+
```
86+
87+
### 3. Finding Ethereum Addresses
88+
89+
```python
90+
schema = client.datasets.describe('namespace', 'dataset-name')
91+
92+
# Find all address columns (20-byte binary fields)
93+
for table_name, columns in schema.items():
94+
address_cols = [col['name'] for col in columns if col['type'] == 'FixedSizeBinary(20)']
95+
if address_cols:
96+
print(f"{table_name}: {', '.join(address_cols)}")
97+
98+
# Example output:
99+
# blocks: miner
100+
# transactions: from, to
101+
# logs: address
102+
```
103+
104+
### 4. Finding Transaction/Block Hashes
105+
106+
```python
107+
schema = client.datasets.describe('namespace', 'dataset-name')
108+
109+
# Find all hash columns (32-byte binary fields)
110+
for table_name, columns in schema.items():
111+
hash_cols = [col['name'] for col in columns if col['type'] == 'FixedSizeBinary(32)']
112+
if hash_cols:
113+
print(f"{table_name}: {', '.join(hash_cols)}")
114+
115+
# Example output:
116+
# blocks: hash, parent_hash, state_root, transactions_root
117+
# transactions: block_hash, tx_hash
118+
# logs: block_hash, tx_hash, topic0, topic1, topic2, topic3
119+
```
120+
121+
### 5. Checking Nullable Columns
122+
123+
```python
124+
schema = client.datasets.describe('namespace', 'dataset-name')
125+
126+
# Find columns that allow NULL values (important for data quality)
127+
for table_name, columns in schema.items():
128+
nullable_cols = [col['name'] for col in columns if col['nullable']]
129+
print(f"{table_name}: {len(nullable_cols)}/{len(columns)} nullable columns")
130+
print(f" Nullable: {', '.join(nullable_cols[:5])}")
131+
132+
# Example output:
133+
# transactions: 5/24 nullable columns
134+
# Nullable: to, gas_price, value, max_fee_per_gas, max_priority_fee_per_gas
135+
```
136+
137+
### 6. Building Dynamic Queries
138+
139+
```python
140+
from amp import Client
141+
142+
registry_client = RegistryClient()
143+
client = Client(
144+
query_url='grpc://localhost:1602',
145+
admin_url='http://localhost:8080',
146+
auth=True
147+
)
148+
149+
# Discover available tables
150+
schema = registry_client.datasets.describe('namespace', 'dataset-name')
151+
print(f"Available tables: {list(schema.keys())}")
152+
153+
# Build query based on available columns
154+
if 'blocks' in schema:
155+
block_cols = [col['name'] for col in schema['blocks']]
156+
if 'block_num' in block_cols and 'timestamp' in block_cols:
157+
# Safe to query these columns
158+
result = client.sql("SELECT block_num, timestamp FROM blocks LIMIT 10")
159+
```
160+
161+
## Supported Arrow Types
162+
163+
The `describe()` and `inspect()` methods handle these Arrow types:
164+
165+
- **Primitives**: `UInt64`, `Int32`, `Boolean`, `Binary`
166+
- **Timestamps**: `Timestamp(Nanosecond)`, `Timestamp(Microsecond)`, etc.
167+
- **Fixed-size Binary**: `FixedSizeBinary(20)` (addresses), `FixedSizeBinary(32)` (hashes)
168+
- **Decimals**: `Decimal128(38,0)` (large integers), `Decimal128(18,6)` (fixed-point)
169+
170+
## Complete Example
171+
172+
```python
173+
from amp.registry import RegistryClient
174+
from amp import Client
175+
176+
# Step 1: Discover datasets
177+
registry = RegistryClient()
178+
results = registry.datasets.search('ethereum blocks')
179+
180+
print("Available datasets:")
181+
for ds in results.datasets[:5]:
182+
print(f"{ds.namespace}/{ds.name}")
183+
184+
# Step 2: Inspect a dataset
185+
print("\nInspecting dataset structure:")
186+
registry.datasets.inspect('graphops', 'ethereum-mainnet')
187+
188+
# Step 3: Get schema programmatically
189+
schema = registry.datasets.describe('graphops', 'ethereum-mainnet')
190+
191+
# Step 4: Query based on discovered schema
192+
client = Client(query_url='grpc://your-server:1602', auth=True)
193+
194+
# Find tables with block_num column
195+
tables_with_blocks = [
196+
table for table, cols in schema.items()
197+
if any(col['name'] == 'block_num' for col in cols)
198+
]
199+
200+
for table in tables_with_blocks:
201+
print(f"\nQuerying {table}...")
202+
results = client.sql(f"SELECT * FROM {table} LIMIT 5").to_arrow()
203+
print(f" Rows: {len(results)}")
204+
```
205+
206+
## Tips
207+
208+
1. **Use `inspect()` interactively**: Great for Jupyter notebooks or REPL exploration
209+
2. **Use `describe()` in scripts**: When you need programmatic access to schema info
210+
3. **Check nullability**: The `nullable` field tells you if a column can have NULL values
211+
4. **Version pinning**: Always specify a version in production (`version='1.2.3'`) instead of using `'latest'`

0 commit comments

Comments
 (0)