Skip to content

Commit 4d8f392

Browse files
authored
doc(data/README.md): sync with recent changes (#39)
This diff updates `data/README.md` to sync its content with recent changes regarding how we store cached queries.
1 parent 1aa2c13 commit 4d8f392

File tree

1 file changed

+50
-37
lines changed

1 file changed

+50
-37
lines changed

data/README.md

Lines changed: 50 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,52 @@
11
# IQB Static Data Files
22

3-
This directory contains static measurement data used by
4-
the IQB prototype for Phase 1 development.
3+
This directory contains static reference data used by the IQB prototype.
54

65
## Current Dataset
76

8-
**Period**: October 2024 (2024-10-01 to 2024-10-31) and October 2025
7+
**Period**: October 2024 and October 2025
98

109
**Source**: [M-Lab NDT](https://www.measurementlab.net/tests/ndt/) unified views
1110

12-
**Countries**: all available countries
11+
**Countries**: All available countries
1312

14-
### Files
13+
## Data Formats
1514

16-
Generated files live inside [./cache/v0](./cache/v0).
15+
We maintain two data formats in `./cache/`:
1716

18-
Here are some sample files:
17+
### v0 - JSON Format
1918

20-
- `us_2024_10.json` - United States, ~31M download samples, ~24M upload samples
19+
Per-country JSON files with pre-aggregated percentiles:
2120

22-
- `de_2024_10.json` - Germany, ~7M download samples, ~4M upload samples
21+
- **Location**: `./cache/v0/{country}_{year}_{month}.json`
22+
- **Example**: `us_2024_10.json` (~31M download samples, ~24M upload samples)
23+
- **Structure**: Simple JSON with percentiles (p1, p5, p10, p25, p50, p75, p90, p95, p99)
24+
- **Use case**: Casual data processing, backward compatibility, quick inspection
2325

24-
- `br_2024_10.json` - Brazil, ~5M download samples, ~3M upload samples
25-
26-
### Data Structure
27-
28-
Each JSON file contains:
29-
30-
```JavaScript
26+
```json
3127
{
3228
"metrics": {
33-
"download_throughput_mbps": {"p1": 0.38, /* ... */, "p99": 891.82},
34-
"upload_throughput_mbps": {"p1": 0.06, /* ... */, "p99": 813.73},
35-
"latency_ms": {"p1": 0.16, /* ... */, "p99": 254.34},
36-
"packet_loss": {"p1": 0.0, /* ... */, "p99": 0.25}
29+
"download_throughput_mbps": {"p1": 0.38, "p99": 891.82},
30+
"upload_throughput_mbps": {"p1": 0.06, "p99": 813.73},
31+
"latency_ms": {"p1": 0.16, "p99": 254.34},
32+
"packet_loss": {"p1": 0.0, "p99": 0.25}
3733
}
3834
}
3935
```
4036

41-
**Percentiles included**: p1, p5, p10, p25, p50, p75, p90, p95, p99
37+
### v1 - Parquet Format (Current)
38+
39+
Raw query results stored efficiently for flexible analysis:
40+
41+
- **Location**: `./cache/v1/{start_date}/{end_date}/{query_type}/`
42+
- **Files**:
43+
- `data.parquet` - Query results (~1-60 MiB, streamable, chunked row groups)
44+
- `stats.json` - Query metadata (start time, duration, bytes processed/billed, template hash)
45+
- **Use case**: Efficient filtering, large-scale analysis, direct PyArrow/Pandas processing
46+
47+
**Migration**: We're transitioning to v1 as the primary format. v0 remains available for
48+
backward compatibility and casual use. If Parquet proves too heavy for some workflows,
49+
v0 will continue to be maintained.
4250

4351
## How This Data Was Generated
4452

@@ -75,19 +83,23 @@ This orchestrates the complete pipeline:
7583

7684
3. Merges the data into per-country JSON files
7785

78-
Generated files `${country}_2024_10.json` and `${country}_2025_10.json`
79-
inside the [./cache/v0](./cache/v0) directory.
86+
Generated files: v0 JSON files `${country}_2024_10.json` and `${country}_2025_10.json`
87+
inside [./cache/v0](./cache/v0), plus v1 Parquet cache with query metadata.
8088

8189
**Individual Pipeline Stages** (for debugging):
8290

8391
```bash
8492
cd data/
8593

8694
# Stage 1a: Query downloads
87-
uv run python run_query.py query_downloads.sql -o downloads.json
95+
uv run python run_query.py downloads_by_country \
96+
--start-date 2024-10-01 --end-date 2024-11-01 \
97+
-o downloads.json
8898

8999
# Stage 1b: Query uploads
90-
uv run python run_query.py query_uploads.sql -o uploads.json
100+
uv run python run_query.py uploads_by_country \
101+
--start-date 2024-10-01 --end-date 2024-11-01 \
102+
-o uploads.json
91103

92104
# Stage 2: Merge data
93105
uv run python merge_data.py
@@ -97,25 +109,30 @@ uv run python merge_data.py
97109

98110
- [generate_data.py](generate_data.py) - Orchestrates the complete pipeline
99111

100-
- [run_query.py](run_query.py) - Executes a BigQuery query and saves results
112+
- [run_query.py](run_query.py) - Executes BigQuery queries using IQBPipeline,
113+
saves v1 cache (parquet + stats) and v0 JSON output
101114

102115
- [merge_data.py](merge_data.py) - Merges download and upload data into
103-
per-country files
116+
per-country v0 files
104117

105118
## Notes
106119

107120
- **Static data**: These files contain pre-aggregated percentiles
108121
for Phase 1 prototype. Phase 2 will add dynamic data fetching.
109122

123+
- **Data formats**: v0 JSON files (~1.4KB) for quick analysis;
124+
v1 Parquet files (~1-60 MiB) with stats.json for efficient processing and cost tracking.
125+
110126
- **Time granularity**: Data is aggregated over the entire
111127
months of October 2024 and October 2025. The analyst decides which
112128
time window to use for running IQB calculations.
113129

114130
- **Percentile selection**: The Streamlit UI allows users
115131
to select which percentile(s) to use for IQB score calculations.
116132

117-
- **File size**: Each file is ~1.4KB (uncompressed). No
118-
compression needed.
133+
- **File size**: Each per-country JSON file is ~1.4KB (uncompressed). No
134+
compression needed. For more fine grained queries, the Parquet files
135+
allow for more efficient storage and data processing.
119136

120137
## M-Lab NDT Data Schema
121138

@@ -142,12 +159,8 @@ for details.
142159

143160
## Future Improvements (Phase 2+)
144161

145-
- Dynamic data fetching from BigQuery
146-
147-
- Support for additional datasets (Ookla, Cloudflare)
148-
162+
- Direct Parquet reading in cache.py (PyArrow predicate pushdown for efficient filtering)
163+
- Additional datasets (Ookla, Cloudflare)
164+
- Finer geographic resolution (cities, provinces, ASNs)
149165
- Finer time granularity (daily, weekly)
150-
151-
- Sub-national geographic resolution (cities, ASNs)
152-
153-
- Local database integration for caching aggregated data
166+
- Remote storage for v1 cache (GitHub releases, GCS buckets)

0 commit comments

Comments
 (0)