doc(data): update README.md (#121)

bassosimone · web-flow · commit d37278669164 · 2025-12-24T18:50:46.000+01:00
Explain that `./data` does not contain static data anymore and is now
basically a workspace with scripts that we've not properly integrated
into IQB yet. While there, also update the related READMEs to mention
recent changes that caused us to stop using static data in `./data` and
which ultimately changed the semantics of `./data`.
diff --git a/README.md b/README.md
@@ -59,7 +59,7 @@ See [analysis/README.md](analysis/README.md) for more information.
 
 ### **`data/`**
 
-Sample datasets used in the IQB app prototype and notebooks.
+Workspace for data curation scripts, release manifests, and local cache artifacts.
 
 See [data/README.md](data/README.md) for details.
 
diff --git a/analysis/README.md b/analysis/README.md
@@ -29,6 +29,12 @@ Template demonstrating basic IQB usage:
 
 Use this as a starting point for custom analysis.
 
+## Local Cache for Notebook Tests
+
+The test suite seeds `analysis/.iqb` with a small cache snapshot to avoid
+network downloads when executing notebooks in CI. If you need to refresh
+or replace the cached month, update `analysis/.iqb` accordingly.
+
 ## Running Notebooks
 
 ### In VSCode
diff --git a/data/README.md b/data/README.md
@@ -1,8 +1,25 @@
-# IQB Static Data Files
+# IQB Data Workspace
 
-This directory contains static reference data used by the IQB prototype.
+This directory is a workspace for data curation scripts, release manifests,
+and local cache artifacts produced during generation.
 
-## Data Format
+## What Lives Here
+
+- `generate_data.py`: Orchestrates BigQuery extraction and writes cache files
+  under `./cache/v1/` for local use.
+- `run_query.py`: Legacy single-query helper (kept for now, but not the
+  preferred workflow).
+- `ghcache.py`: Helper for publishing cache files to GitHub releases.
+- `state/ghremote/manifest.json`: Release manifest used by the GitHub remote
+  cache implementation.
+
+Static cache fixtures used by tests and notebooks are stored elsewhere:
+
+- Real data fixtures: `library/tests/fixtures/real-data`
+- Fake data fixtures: `library/tests/fixtures/fake-data`
+- Notebook cache: `analysis/.iqb` (seeded to avoid network downloads in tests)
+
+## Cache Format
 
 Raw query results stored efficiently for flexible analysis:
 
@@ -14,8 +31,6 @@ Raw query results stored efficiently for flexible analysis:
 
 ## GitHub Cache Synchronization (Interim Solution)
 
-**IMPORTANT**: This is a throwaway interim solution that will be replaced by GCS.
-
 Since the v1 Parquet files can be large (~1-60 MiB) and we have BigQuery quota
 constraints, we use GitHub releases to distribute pre-generated cache files.
 
@@ -26,7 +41,7 @@ The GitHub release manifest lives at `state/ghremote/manifest.json`.
 
 ### For Maintainers (Publishing New Cache)
 
-When you generate new cache files locally:
+When you generate new cache files locally (under `./cache/v1`):
 
 ```bash
 uv run ./data/ghcache.py scan
@@ -79,7 +94,7 @@ uv run python run_query.py --granularity country \
   --start-date 2024-10-01 --end-date 2024-11-01
 
 # Inspect results with pandas
-python3 << 'EOF'
+uv run python << 'EOF'
 import pandas as pd
 df = pd.read_parquet('cache/v1/2024-10-01/2024-11-01/downloads_by_country/data.parquet')
 print(df.head())