Skip to content

Commit 66e0a7e

Browse files
authored
Support dataset manifest inspection (#21)
* feat(utils): Add shared manifest inspection utilities - describe_manifest(): Extract structured schema from manifests - format_arrow_type(): Format Arrow types into readable strings - print_schema(): Pretty-print schema in human-readable format * feat(registry): Add inspect() and describe() methods to datasets client - describe(namespace, name, version): Returns structured schema dictionary mapping table names to column info (name, type, nullable) - inspect(namespace, name, version): Pretty-prints dataset structure in human-readable format for interactive exploration - Both methods use the shared manifest inspection utilities for consistency * feat(admin): Add inspect() and describe() methods to datasets client - describe(namespace, name, revision): Returns structured schema dictionary - inspect(namespace, name, revision): Pretty-prints dataset structure * docs: Add comprehensive guide for dataset inspection - Full guide covering both Registry and Admin clients - Use cases: interactive exploration, finding specific columns, type inspection, checking nullability, building dynamic queries - Practical examples for finding Ethereum addresses and hashes - Complete API reference with expected output
1 parent 651bd29 commit 66e0a7e

File tree

6 files changed

+606
-1
lines changed

6 files changed

+606
-1
lines changed

docs/inspecting_datasets.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Inspecting Dataset Schemas
2+
3+
The Registry client and Admin API on standard client provide convenient methods to explore dataset structures without manually parsing manifests.
4+
5+
## Quick Start
6+
7+
```python
8+
from amp.registry import RegistryClient
9+
10+
client = RegistryClient() # Note: Inspection functionality is also available on the Admin api of the regular client (Client())
11+
12+
# Pretty-print dataset structure
13+
client.datasets.inspect('edgeandnode', 'ethereum-mainnet')
14+
15+
# Get structured schema data
16+
schema = client.datasets.describe('edgeandnode', 'ethereum-mainnet')
17+
```
18+
19+
## Methods
20+
21+
### `inspect(namespace, name, version='latest')`
22+
23+
Pretty-prints the dataset structure in a human-readable format. Perfect for interactive exploration.
24+
25+
**Example Output:**
26+
```
27+
Dataset: edgeandnode/ethereum-mainnet@latest
28+
Description: Ethereum mainnet blockchain data
29+
30+
📊 blocks (21 columns)
31+
• block_num UInt64 NOT NULL
32+
• timestamp Timestamp(Nanosecond) NOT NULL
33+
• hash FixedSizeBinary(32) NOT NULL
34+
• parent_hash FixedSizeBinary(32) NOT NULL
35+
• miner FixedSizeBinary(20) NOT NULL
36+
...
37+
38+
📊 transactions (24 columns)
39+
• block_num UInt64 NOT NULL
40+
• tx_hash FixedSizeBinary(32) NOT NULL
41+
• from FixedSizeBinary(20) NOT NULL
42+
• to FixedSizeBinary(20) NULL
43+
...
44+
```
45+
46+
### `describe(namespace, name, version='latest')`
47+
48+
Returns a structured dictionary mapping table names to column information. Use this for programmatic access.
49+
50+
**Returns:**
51+
```python
52+
{
53+
'blocks': [
54+
{'name': 'block_num', 'type': 'UInt64', 'nullable': False},
55+
{'name': 'timestamp', 'type': 'Timestamp(Nanosecond)', 'nullable': False},
56+
{'name': 'hash', 'type': 'FixedSizeBinary(32)', 'nullable': False},
57+
...
58+
],
59+
'transactions': [
60+
{'name': 'tx_hash', 'type': 'FixedSizeBinary(32)', 'nullable': False},
61+
...
62+
]
63+
}
64+
```
65+
66+
## Use Cases
67+
68+
### 1. Interactive Exploration
69+
70+
```python
71+
# Quickly see what's available
72+
client.datasets.inspect('namespace', 'dataset-name')
73+
```
74+
75+
### 2. Finding Specific Columns
76+
77+
```python
78+
schema = client.datasets.describe('namespace', 'dataset-name')
79+
80+
# Find tables with specific columns
81+
for table_name, columns in schema.items():
82+
col_names = [col['name'] for col in columns]
83+
if 'address' in col_names:
84+
print(f"Table '{table_name}' has an address column")
85+
```
86+
87+
### 3. Finding Ethereum Addresses
88+
89+
```python
90+
schema = client.datasets.describe('namespace', 'dataset-name')
91+
92+
# Find all address columns (20-byte binary fields)
93+
for table_name, columns in schema.items():
94+
address_cols = [col['name'] for col in columns if col['type'] == 'FixedSizeBinary(20)']
95+
if address_cols:
96+
print(f"{table_name}: {', '.join(address_cols)}")
97+
98+
# Example output:
99+
# blocks: miner
100+
# transactions: from, to
101+
# logs: address
102+
```
103+
104+
### 4. Finding Transaction/Block Hashes
105+
106+
```python
107+
schema = client.datasets.describe('namespace', 'dataset-name')
108+
109+
# Find all hash columns (32-byte binary fields)
110+
for table_name, columns in schema.items():
111+
hash_cols = [col['name'] for col in columns if col['type'] == 'FixedSizeBinary(32)']
112+
if hash_cols:
113+
print(f"{table_name}: {', '.join(hash_cols)}")
114+
115+
# Example output:
116+
# blocks: hash, parent_hash, state_root, transactions_root
117+
# transactions: block_hash, tx_hash
118+
# logs: block_hash, tx_hash, topic0, topic1, topic2, topic3
119+
```
120+
121+
### 5. Checking Nullable Columns
122+
123+
```python
124+
schema = client.datasets.describe('namespace', 'dataset-name')
125+
126+
# Find columns that allow NULL values (important for data quality)
127+
for table_name, columns in schema.items():
128+
nullable_cols = [col['name'] for col in columns if col['nullable']]
129+
print(f"{table_name}: {len(nullable_cols)}/{len(columns)} nullable columns")
130+
print(f" Nullable: {', '.join(nullable_cols[:5])}")
131+
132+
# Example output:
133+
# transactions: 5/24 nullable columns
134+
# Nullable: to, gas_price, value, max_fee_per_gas, max_priority_fee_per_gas
135+
```
136+
137+
### 6. Building Dynamic Queries
138+
139+
```python
140+
from amp import Client
141+
142+
registry_client = RegistryClient()
143+
client = Client(
144+
query_url='grpc://localhost:1602',
145+
admin_url='http://localhost:8080',
146+
auth=True
147+
)
148+
149+
# Discover available tables
150+
schema = registry_client.datasets.describe('namespace', 'dataset-name')
151+
print(f"Available tables: {list(schema.keys())}")
152+
153+
# Build query based on available columns
154+
if 'blocks' in schema:
155+
block_cols = [col['name'] for col in schema['blocks']]
156+
if 'block_num' in block_cols and 'timestamp' in block_cols:
157+
# Safe to query these columns
158+
result = client.sql("SELECT block_num, timestamp FROM blocks LIMIT 10")
159+
```
160+
161+
## Supported Arrow Types
162+
163+
The `describe()` and `inspect()` methods handle these Arrow types:
164+
165+
- **Primitives**: `UInt64`, `Int32`, `Boolean`, `Binary`
166+
- **Timestamps**: `Timestamp(Nanosecond)`, `Timestamp(Microsecond)`, etc.
167+
- **Fixed-size Binary**: `FixedSizeBinary(20)` (addresses), `FixedSizeBinary(32)` (hashes)
168+
- **Decimals**: `Decimal128(38,0)` (large integers), `Decimal128(18,6)` (fixed-point)
169+
170+
## Complete Example
171+
172+
```python
173+
from amp.registry import RegistryClient
174+
from amp import Client
175+
176+
# Step 1: Discover datasets
177+
registry = RegistryClient()
178+
results = registry.datasets.search('ethereum blocks')
179+
180+
print("Available datasets:")
181+
for ds in results.datasets[:5]:
182+
print(f"{ds.namespace}/{ds.name}")
183+
184+
# Step 2: Inspect a dataset
185+
print("\nInspecting dataset structure:")
186+
registry.datasets.inspect('graphops', 'ethereum-mainnet')
187+
188+
# Step 3: Get schema programmatically
189+
schema = registry.datasets.describe('graphops', 'ethereum-mainnet')
190+
191+
# Step 4: Query based on discovered schema
192+
client = Client(query_url='grpc://your-server:1602', auth=True)
193+
194+
# Find tables with block_num column
195+
tables_with_blocks = [
196+
table for table, cols in schema.items()
197+
if any(col['name'] == 'block_num' for col in cols)
198+
]
199+
200+
for table in tables_with_blocks:
201+
print(f"\nQuerying {table}...")
202+
results = client.sql(f"SELECT * FROM {table} LIMIT 5").to_arrow()
203+
print(f" Rows: {len(results)}")
204+
```
205+
206+
## Tips
207+
208+
1. **Use `inspect()` interactively**: Great for Jupyter notebooks or REPL exploration
209+
2. **Use `describe()` in scripts**: When you need programmatic access to schema info
210+
3. **Check nullability**: The `nullable` field tells you if a column can have NULL values
211+
4. **Version pinning**: Always specify a version in production (`version='1.2.3'`) instead of using `'latest'`

src/amp/admin/datasets.py

Lines changed: 74 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44
including registration, deployment, versioning, and manifest operations.
55
"""
66

7-
from typing import TYPE_CHECKING, Optional
7+
from typing import TYPE_CHECKING, Dict, Optional
8+
9+
from amp.utils.manifest_inspector import describe_manifest, print_schema
810

911
from . import models
1012

@@ -198,6 +200,77 @@ def get_manifest(self, namespace: str, name: str, revision: str) -> dict:
198200
response = self._admin._request('GET', path)
199201
return response.json()
200202

203+
def describe(self, namespace: str, name: str, revision: str = 'latest') -> Dict[str, list[Dict[str, str | bool]]]:
204+
"""Get a structured summary of tables and columns in a dataset.
205+
206+
Returns a dictionary mapping table names to lists of column information,
207+
making it easy to programmatically inspect the dataset schema.
208+
209+
Args:
210+
namespace: Dataset namespace
211+
name: Dataset name
212+
revision: Version tag (default: 'latest')
213+
214+
Returns:
215+
dict: Mapping of table names to column information. Each column is a dict with:
216+
- name: Column name (str)
217+
- type: Arrow type (str, simplified representation)
218+
- nullable: Whether the column allows NULL values (bool)
219+
220+
Example:
221+
>>> client = AdminClient('http://localhost:8080')
222+
>>> schema = client.datasets.describe('_', 'eth_firehose', 'latest')
223+
>>> for table_name, columns in schema.items():
224+
... print(f"\\nTable: {table_name}")
225+
... for col in columns:
226+
... nullable = "NULL" if col['nullable'] else "NOT NULL"
227+
... print(f" {col['name']}: {col['type']} {nullable}")
228+
"""
229+
manifest = self.get_manifest(namespace, name, revision)
230+
return describe_manifest(manifest)
231+
232+
def inspect(self, namespace: str, name: str, revision: str = 'latest') -> None:
233+
"""Pretty-print the structure of a dataset for easy inspection.
234+
235+
Displays tables and their columns in a human-readable format.
236+
This is perfect for exploring datasets interactively.
237+
238+
Args:
239+
namespace: Dataset namespace
240+
name: Dataset name
241+
revision: Version tag (default: 'latest')
242+
243+
Example:
244+
>>> client = AdminClient('http://localhost:8080')
245+
>>> client.datasets.inspect('_', 'eth_firehose')
246+
Dataset: _/eth_firehose@latest
247+
248+
blocks (21 columns)
249+
block_num UInt64 NOT NULL
250+
timestamp Timestamp NOT NULL
251+
hash FixedSizeBinary(32) NOT NULL
252+
...
253+
254+
transactions (24 columns)
255+
tx_hash FixedSizeBinary(32) NOT NULL
256+
from FixedSizeBinary(20) NOT NULL
257+
to FixedSizeBinary(20) NULL
258+
...
259+
"""
260+
header = f'Dataset: {namespace}/{name}@{revision}'
261+
262+
# Try to get version info for additional context (optional, might not always work)
263+
try:
264+
version_info = self.get_version(namespace, name, revision)
265+
if hasattr(version_info, 'kind'):
266+
header += f'\nKind: {version_info.kind}'
267+
except Exception:
268+
# If we can't get version info, that's okay - just continue
269+
pass
270+
271+
schema = self.describe(namespace, name, revision)
272+
print_schema(schema, header=header)
273+
201274
def delete(self, namespace: str, name: str) -> None:
202275
"""Delete all versions and metadata for a dataset.
203276

src/amp/registry/datasets.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55
import logging
66
from typing import TYPE_CHECKING, Any, Dict, Optional
77

8+
from amp.utils.manifest_inspector import describe_manifest, print_schema
9+
810
from . import models
911

1012
if TYPE_CHECKING:
@@ -197,6 +199,72 @@ def get_manifest(self, namespace: str, name: str, version: str) -> dict:
197199
response = self._registry._request('GET', path)
198200
return response.json()
199201

202+
def describe(self, namespace: str, name: str, version: str = 'latest') -> Dict[str, list[Dict[str, str | bool]]]:
203+
"""Get a structured summary of tables and columns in a dataset.
204+
205+
Returns a dictionary mapping table names to lists of column information,
206+
making it easy to programmatically inspect the dataset schema.
207+
208+
Args:
209+
namespace: Dataset namespace
210+
name: Dataset name
211+
version: Version tag (default: 'latest')
212+
213+
Returns:
214+
dict: Mapping of table names to column information. Each column is a dict with:
215+
- name: Column name (str)
216+
- type: Arrow type (str, simplified representation)
217+
- nullable: Whether the column allows NULL values (bool)
218+
219+
Example:
220+
>>> client = RegistryClient()
221+
>>> schema = client.datasets.describe('edgeandnode', 'ethereum-mainnet', 'latest')
222+
>>> for table_name, columns in schema.items():
223+
... print(f"\\nTable: {table_name}")
224+
... for col in columns:
225+
... nullable = "NULL" if col['nullable'] else "NOT NULL"
226+
... print(f" {col['name']}: {col['type']} {nullable}")
227+
"""
228+
manifest = self.get_manifest(namespace, name, version)
229+
return describe_manifest(manifest)
230+
231+
def inspect(self, namespace: str, name: str, version: str = 'latest') -> None:
232+
"""Pretty-print the structure of a dataset for easy inspection.
233+
234+
Displays tables and their columns in a human-readable format.
235+
This is perfect for exploring datasets interactively.
236+
237+
Args:
238+
namespace: Dataset namespace
239+
name: Dataset name
240+
version: Version tag (default: 'latest')
241+
242+
Example:
243+
>>> client = RegistryClient()
244+
>>> client.datasets.inspect('graphops', 'ethereum-mainnet')
245+
Dataset: graphops/ethereum-mainnet@latest
246+
247+
blocks (4 columns)
248+
block_num UInt64 NOT NULL
249+
timestamp Timestamp NOT NULL
250+
hash FixedSizeBinary(32) NOT NULL
251+
parent_hash FixedSizeBinary(32) NOT NULL
252+
253+
transactions (23 columns)
254+
block_num UInt64 NOT NULL
255+
tx_hash FixedSizeBinary(32) NOT NULL
256+
...
257+
"""
258+
# Get dataset info
259+
dataset = self.get(namespace, name)
260+
header = f'Dataset: {namespace}/{name}@{version}'
261+
if dataset.description:
262+
header += f'\nDescription: {dataset.description}'
263+
264+
# Get schema and print
265+
schema = self.describe(namespace, name, version)
266+
print_schema(schema, header=header)
267+
200268
# Write Operations (Require Authentication)
201269

202270
def publish(

src/amp/utils/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""Utility modules for amp Python client."""
2+
3+
from .manifest_inspector import describe_manifest, format_arrow_type, print_schema
4+
5+
__all__ = ['describe_manifest', 'format_arrow_type', 'print_schema']

0 commit comments

Comments
 (0)