Skip to content

Commit d372786

Browse files
authored
doc(data): update README.md (#121)
Explain that `./data` does not contain static data anymore and is now basically a workspace with scripts that we've not properly integrated into IQB yet. While there, also update the related READMEs to mention recent changes that caused us to stop using static data in `./data` and which ultimately changed the semantics of `./data`.
1 parent cb37765 commit d372786

File tree

3 files changed

+29
-8
lines changed

3 files changed

+29
-8
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ See [analysis/README.md](analysis/README.md) for more information.
5959

6060
### **`data/`**
6161

62-
Sample datasets used in the IQB app prototype and notebooks.
62+
Workspace for data curation scripts, release manifests, and local cache artifacts.
6363

6464
See [data/README.md](data/README.md) for details.
6565

analysis/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,12 @@ Template demonstrating basic IQB usage:
2929

3030
Use this as a starting point for custom analysis.
3131

32+
## Local Cache for Notebook Tests
33+
34+
The test suite seeds `analysis/.iqb` with a small cache snapshot to avoid
35+
network downloads when executing notebooks in CI. If you need to refresh
36+
or replace the cached month, update `analysis/.iqb` accordingly.
37+
3238
## Running Notebooks
3339

3440
### In VSCode

data/README.md

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,25 @@
1-
# IQB Static Data Files
1+
# IQB Data Workspace
22

3-
This directory contains static reference data used by the IQB prototype.
3+
This directory is a workspace for data curation scripts, release manifests,
4+
and local cache artifacts produced during generation.
45

5-
## Data Format
6+
## What Lives Here
7+
8+
- `generate_data.py`: Orchestrates BigQuery extraction and writes cache files
9+
under `./cache/v1/` for local use.
10+
- `run_query.py`: Legacy single-query helper (kept for now, but not the
11+
preferred workflow).
12+
- `ghcache.py`: Helper for publishing cache files to GitHub releases.
13+
- `state/ghremote/manifest.json`: Release manifest used by the GitHub remote
14+
cache implementation.
15+
16+
Static cache fixtures used by tests and notebooks are stored elsewhere:
17+
18+
- Real data fixtures: `library/tests/fixtures/real-data`
19+
- Fake data fixtures: `library/tests/fixtures/fake-data`
20+
- Notebook cache: `analysis/.iqb` (seeded to avoid network downloads in tests)
21+
22+
## Cache Format
623

724
Raw query results stored efficiently for flexible analysis:
825

@@ -14,8 +31,6 @@ Raw query results stored efficiently for flexible analysis:
1431

1532
## GitHub Cache Synchronization (Interim Solution)
1633

17-
**IMPORTANT**: This is a throwaway interim solution that will be replaced by GCS.
18-
1934
Since the v1 Parquet files can be large (~1-60 MiB) and we have BigQuery quota
2035
constraints, we use GitHub releases to distribute pre-generated cache files.
2136

@@ -26,7 +41,7 @@ The GitHub release manifest lives at `state/ghremote/manifest.json`.
2641

2742
### For Maintainers (Publishing New Cache)
2843

29-
When you generate new cache files locally:
44+
When you generate new cache files locally (under `./cache/v1`):
3045

3146
```bash
3247
uv run ./data/ghcache.py scan
@@ -79,7 +94,7 @@ uv run python run_query.py --granularity country \
7994
--start-date 2024-10-01 --end-date 2024-11-01
8095

8196
# Inspect results with pandas
82-
python3 << 'EOF'
97+
uv run python << 'EOF'
8398
import pandas as pd
8499
df = pd.read_parquet('cache/v1/2024-10-01/2024-11-01/downloads_by_country/data.parquet')
85100
print(df.head())

0 commit comments

Comments
 (0)