11# IQB Static Data Files
22
3- This directory contains static measurement data used by
4- the IQB prototype for Phase 1 development.
3+ This directory contains static reference data used by the IQB prototype.
54
65## Current Dataset
76
8- ** Period** : October 2024 (2024-10-01 to 2024-10-31) and October 2025
7+ ** Period** : October 2024 and October 2025
98
109** Source** : [ M-Lab NDT] ( https://www.measurementlab.net/tests/ndt/ ) unified views
1110
12- ** Countries** : all available countries
11+ ** Countries** : All available countries
1312
14- ### Files
13+ ## Data Formats
1514
16- Generated files live inside [ ./cache/v0 ] ( ./cache/v0 ) .
15+ We maintain two data formats in ` ./cache/ ` :
1716
18- Here are some sample files:
17+ ### v0 - JSON Format
1918
20- - ` us_2024_10.json ` - United States, ~ 31M download samples, ~ 24M upload samples
19+ Per-country JSON files with pre-aggregated percentiles:
2120
22- - ` de_2024_10.json ` - Germany, ~ 7M download samples, ~ 4M upload samples
21+ - ** Location** : ` ./cache/v0/{country}_{year}_{month}.json `
22+ - ** Example** : ` us_2024_10.json ` (~ 31M download samples, ~ 24M upload samples)
23+ - ** Structure** : Simple JSON with percentiles (p1, p5, p10, p25, p50, p75, p90, p95, p99)
24+ - ** Use case** : Casual data processing, backward compatibility, quick inspection
2325
24- - ` br_2024_10.json ` - Brazil, ~ 5M download samples, ~ 3M upload samples
25-
26- ### Data Structure
27-
28- Each JSON file contains:
29-
30- ``` JavaScript
26+ ``` json
3127{
3228 "metrics" : {
33- " download_throughput_mbps" : {" p1" : 0.38 , /* ... */ , " p99" : 891.82 },
34- " upload_throughput_mbps" : {" p1" : 0.06 , /* ... */ , " p99" : 813.73 },
35- " latency_ms" : {" p1" : 0.16 , /* ... */ , " p99" : 254.34 },
36- " packet_loss" : {" p1" : 0.0 , /* ... */ , " p99" : 0.25 }
29+ "download_throughput_mbps" : {"p1" : 0.38 , "p99" : 891.82 },
30+ "upload_throughput_mbps" : {"p1" : 0.06 , "p99" : 813.73 },
31+ "latency_ms" : {"p1" : 0.16 , "p99" : 254.34 },
32+ "packet_loss" : {"p1" : 0.0 , "p99" : 0.25 }
3733 }
3834}
3935```
4036
41- ** Percentiles included** : p1, p5, p10, p25, p50, p75, p90, p95, p99
37+ ### v1 - Parquet Format (Current)
38+
39+ Raw query results stored efficiently for flexible analysis:
40+
41+ - ** Location** : ` ./cache/v1/{start_date}/{end_date}/{query_type}/ `
42+ - ** Files** :
43+ - ` data.parquet ` - Query results (~ 1-60 MiB, streamable, chunked row groups)
44+ - ` stats.json ` - Query metadata (start time, duration, bytes processed/billed, template hash)
45+ - ** Use case** : Efficient filtering, large-scale analysis, direct PyArrow/Pandas processing
46+
47+ ** Migration** : We're transitioning to v1 as the primary format. v0 remains available for
48+ backward compatibility and casual use. If Parquet proves too heavy for some workflows,
49+ v0 will continue to be maintained.
4250
4351## How This Data Was Generated
4452
@@ -75,19 +83,23 @@ This orchestrates the complete pipeline:
7583
76843 . Merges the data into per-country JSON files
7785
78- Generated files ` ${country}_2024_10.json ` and ` ${country}_2025_10.json `
79- inside the [ ./cache/v0] ( ./cache/v0 ) directory .
86+ Generated files: v0 JSON files ` ${country}_2024_10.json ` and ` ${country}_2025_10.json `
87+ inside [ ./cache/v0] ( ./cache/v0 ) , plus v1 Parquet cache with query metadata .
8088
8189** Individual Pipeline Stages** (for debugging):
8290
8391``` bash
8492cd data/
8593
8694# Stage 1a: Query downloads
87- uv run python run_query.py query_downloads.sql -o downloads.json
95+ uv run python run_query.py downloads_by_country \
96+ --start-date 2024-10-01 --end-date 2024-11-01 \
97+ -o downloads.json
8898
8999# Stage 1b: Query uploads
90- uv run python run_query.py query_uploads.sql -o uploads.json
100+ uv run python run_query.py uploads_by_country \
101+ --start-date 2024-10-01 --end-date 2024-11-01 \
102+ -o uploads.json
91103
92104# Stage 2: Merge data
93105uv run python merge_data.py
@@ -97,25 +109,30 @@ uv run python merge_data.py
97109
98110- [ generate_data.py] ( generate_data.py ) - Orchestrates the complete pipeline
99111
100- - [ run_query.py] ( run_query.py ) - Executes a BigQuery query and saves results
112+ - [ run_query.py] ( run_query.py ) - Executes BigQuery queries using IQBPipeline,
113+ saves v1 cache (parquet + stats) and v0 JSON output
101114
102115- [ merge_data.py] ( merge_data.py ) - Merges download and upload data into
103- per-country files
116+ per-country v0 files
104117
105118## Notes
106119
107120- ** Static data** : These files contain pre-aggregated percentiles
108121for Phase 1 prototype. Phase 2 will add dynamic data fetching.
109122
123+ - ** Data formats** : v0 JSON files (~ 1.4KB) for quick analysis;
124+ v1 Parquet files (~ 1-60 MiB) with stats.json for efficient processing and cost tracking.
125+
110126- ** Time granularity** : Data is aggregated over the entire
111127months of October 2024 and October 2025. The analyst decides which
112128time window to use for running IQB calculations.
113129
114130- ** Percentile selection** : The Streamlit UI allows users
115131to select which percentile(s) to use for IQB score calculations.
116132
117- - ** File size** : Each file is ~ 1.4KB (uncompressed). No
118- compression needed.
133+ - ** File size** : Each per-country JSON file is ~ 1.4KB (uncompressed). No
134+ compression needed. For more fine grained queries, the Parquet files
135+ allow for more efficient storage and data processing.
119136
120137## M-Lab NDT Data Schema
121138
@@ -142,12 +159,8 @@ for details.
142159
143160## Future Improvements (Phase 2+)
144161
145- - Dynamic data fetching from BigQuery
146-
147- - Support for additional datasets (Ookla, Cloudflare)
148-
162+ - Direct Parquet reading in cache.py (PyArrow predicate pushdown for efficient filtering)
163+ - Additional datasets (Ookla, Cloudflare)
164+ - Finer geographic resolution (cities, provinces, ASNs)
149165- Finer time granularity (daily, weekly)
150-
151- - Sub-national geographic resolution (cities, ASNs)
152-
153- - Local database integration for caching aggregated data
166+ - Remote storage for v1 cache (GitHub releases, GCS buckets)
0 commit comments