Skip to content

Commit 8b1cfb2

Browse files
authored
feat: add initial, static data (#15)
This diff adds initial, static data to the repository. Along with the static data, we include the queries and the scripts used to generate the static data.
1 parent c611bd4 commit 8b1cfb2

File tree

10 files changed

+646
-2
lines changed

10 files changed

+646
-2
lines changed

data/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Intermediate pipeline outputs
2+
/downloads.json
3+
/uploads.json

data/README.md

Lines changed: 182 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,183 @@
1-
# Datasets
1+
# IQB Static Data Files
22

3-
A set of sample datasets.
3+
This directory contains static measurement data used by
4+
the IQB prototype for Phase 1 development.
5+
6+
## Current Dataset
7+
8+
**Period**: October 2024 (2024-10-01 to 2024-10-31)
9+
10+
**Source**: [M-Lab NDT](https://www.measurementlab.net/tests/ndt/) unified views
11+
12+
**Countries**: United States (US), Germany (DE), Brazil (BR)
13+
14+
### Files
15+
16+
- `us_2024_10.json` - United States, ~31M download samples, ~24M upload samples
17+
18+
- `de_2024_10.json` - Germany, ~7M download samples, ~4M upload samples
19+
20+
- `br_2024_10.json` - Brazil, ~5M download samples, ~3M upload samples
21+
22+
### Data Structure
23+
24+
Each JSON file contains:
25+
26+
```JavaScript
27+
{
28+
"metadata": {
29+
"country_code": "US",
30+
"country_name": "United States",
31+
"period": "2024-10",
32+
"period_description": "October 2024",
33+
"dataset": "M-Lab NDT",
34+
"download_samples": 31443312,
35+
"upload_samples": 24288961
36+
},
37+
"metrics": {
38+
"download_throughput_mbps": {"p1": 0.38, /* ... */, "p99": 891.82},
39+
"upload_throughput_mbps": {"p1": 0.06, /* ... */, "p99": 813.73},
40+
"latency_ms": {"p1": 0.16, /* ... */, "p99": 254.34},
41+
"packet_loss": {"p1": 0.0, /* ... */, "p99": 0.25}
42+
}
43+
}
44+
```
45+
46+
**Percentiles included**: p1, p5, p10, p25, p50, p75, p90, p95, p99
47+
48+
## How This Data Was Generated
49+
50+
### BigQuery Queries
51+
52+
The data was extracted from M-Lab's public BigQuery tables using two queries:
53+
54+
1. **Downloads** (`query_downloads.sql`): Queries
55+
`measurement-lab.ndt.unified_downloads` for:
56+
57+
- Download throughput (`a.MeanThroughputMbps`)
58+
59+
- Latency (`a.MinRTT`)
60+
61+
- Packet loss (`a.LossRate`)
62+
63+
2. **Uploads** (`query_uploads.sql`): Queries
64+
`measurement-lab.ndt.unified_uploads` for:
65+
66+
- Upload throughput (`a.MeanThroughputMbps`)
67+
68+
### Running the Data Generation Pipeline
69+
70+
**Prerequisites**:
71+
72+
- Google Cloud SDK (`gcloud`) installed
73+
74+
- BigQuery CLI (`bq`) installed
75+
76+
- `gcloud`-authenticated with an account subscribed to
77+
[M-Lab Discuss mailing list](https://groups.google.com/a/measurementlab.net/g/discuss)
78+
79+
- Python 3.11+
80+
81+
**Complete Pipeline** (recommended):
82+
83+
```bash
84+
cd data/
85+
python3 generate_data.py
86+
```
87+
88+
This orchestrates the complete pipeline:
89+
90+
1. Queries BigQuery for download metrics (throughput, latency, packet loss)
91+
92+
2. Queries BigQuery for upload metrics (throughput)
93+
94+
3. Merges the data into per-country JSON files
95+
96+
Generated files: `us_2024_10.json`, `de_2024_10.json`, `br_2024_10.json`.
97+
98+
**Individual Pipeline Stages** (for debugging):
99+
100+
```bash
101+
cd data/
102+
103+
# Stage 1a: Query downloads
104+
python3 run_query.py query_downloads.sql -o downloads.json
105+
106+
# Stage 1b: Query uploads
107+
python3 run_query.py query_uploads.sql -o uploads.json
108+
109+
# Stage 2: Merge data
110+
python3 merge_data.py
111+
```
112+
113+
**Pipeline Scripts**:
114+
115+
- [generate_data.py](generate_data.py) - Orchestrates the complete pipeline
116+
117+
- [run_query.py](run_query.py) - Executes a BigQuery query and saves results
118+
119+
- [merge_data.py](merge_data.py) - Merges download and upload data into
120+
per-country files
121+
122+
### Modifying Queries
123+
124+
To change the time period or countries, edit the SQL files:
125+
126+
```sql
127+
WHERE
128+
date BETWEEN "2024-10-01" AND "2024-10-31" -- Change dates here
129+
AND client.Geo.CountryCode IN ("US", "DE", "BR") -- Change countries here
130+
```
131+
132+
Country codes follow the
133+
[ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) standard.
134+
135+
## Notes
136+
137+
- **Static data**: These files contain pre-aggregated percentiles
138+
for Phase 1 prototype. Phase 2 will add dynamic data fetching.
139+
140+
- **Time granularity**: Data is aggregated over the entire
141+
month (October 2024). The analyst decides which time window
142+
to use when fethcing data for running IQB calculations.
143+
144+
- **Percentile selection**: The Streamlit UI allows users
145+
to select which percentile(s) to use for IQB score calculations.
146+
147+
- **File size**: Each file is ~1.4KB (uncompressed). No
148+
compression needed.
149+
150+
## M-Lab NDT Data Schema
151+
152+
M-Lab provides two unified views:
153+
154+
- `measurement-lab.ndt.unified_downloads` - Download tests
155+
156+
- `measurement-lab.ndt.unified_uploads` - Upload tests
157+
158+
Key fields used:
159+
160+
- `a.MeanThroughputMbps` - Mean throughput in Mbps
161+
162+
- `a.MinRTT` - Minimum round-trip time in milliseconds
163+
164+
- `a.LossRate` - Packet loss rate (0.0-1.0)
165+
166+
- `client.Geo.CountryCode` - ISO country code
167+
168+
- `date` - Measurement date (YYYY-MM-DD)
169+
170+
See [M-Lab NDT documentation](https://www.measurementlab.net/tests/ndt/#ndt-data-in-bigquery)
171+
for details.
172+
173+
## Future Improvements (Phase 2+)
174+
175+
- Dynamic data fetching from BigQuery
176+
177+
- Support for additional datasets (Ookla, Cloudflare)
178+
179+
- Finer time granularity (daily, weekly)
180+
181+
- Sub-national geographic resolution (cities, ASNs)
182+
183+
- Local database integration for caching aggregated data

data/br_2024_10.json

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
{
2+
"metadata": {
3+
"country_code": "BR",
4+
"country_name": "Brazil",
5+
"period": "2024-10",
6+
"period_description": "October 2024",
7+
"dataset": "M-Lab NDT",
8+
"download_samples": 4944407,
9+
"upload_samples": 3496328
10+
},
11+
"metrics": {
12+
"download_throughput_mbps": {
13+
"p1": 0.15979623373499155,
14+
"p5": 0.9501991252036766,
15+
"p10": 3.101174869710966,
16+
"p25": 15.0340700432778,
17+
"p50": 51.9831305263177,
18+
"p75": 158.38962702858973,
19+
"p90": 330.3352983503099,
20+
"p95": 456.0950392154999,
21+
"p99": 696.5613392781584
22+
},
23+
"upload_throughput_mbps": {
24+
"p1": 0.042563080079753776,
25+
"p5": 0.07560071683921148,
26+
"p10": 0.08980854096320207,
27+
"p25": 5.545812099052701,
28+
"p50": 30.78175191467136,
29+
"p75": 88.37694460346944,
30+
"p90": 181.64033113619195,
31+
"p95": 255.97876412741525,
32+
"p99": 394.3416893812533
33+
},
34+
"latency_ms": {
35+
"p1": 1.394,
36+
"p5": 3.637,
37+
"p10": 4.958,
38+
"p25": 9.079,
39+
"p50": 19.953,
40+
"p75": 52.065,
41+
"p90": 184.738,
42+
"p95": 234.072,
43+
"p99": 273.0
44+
},
45+
"packet_loss": {
46+
"p1": 0.0,
47+
"p5": 0.0,
48+
"p10": 0.0,
49+
"p25": 1.1042755272820004e-05,
50+
"p50": 0.004822712745559209,
51+
"p75": 0.05811090765473097,
52+
"p90": 0.13649207990035975,
53+
"p95": 0.1987869577393624,
54+
"p99": 0.3652163739953438
55+
}
56+
}
57+
}

data/de_2024_10.json

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
{
2+
"metadata": {
3+
"country_code": "DE",
4+
"country_name": "Germany",
5+
"period": "2024-10",
6+
"period_description": "October 2024",
7+
"dataset": "M-Lab NDT",
8+
"download_samples": 7419055,
9+
"upload_samples": 4377008
10+
},
11+
"metrics": {
12+
"download_throughput_mbps": {
13+
"p1": 0.22367850581560372,
14+
"p5": 1.262769802856182,
15+
"p10": 3.4166592054870026,
16+
"p25": 13.817824595534129,
17+
"p50": 45.24430302103892,
18+
"p75": 100.56946051210859,
19+
"p90": 248.78115747983244,
20+
"p95": 377.8657642766346,
21+
"p99": 741.7983223940372
22+
},
23+
"upload_throughput_mbps": {
24+
"p1": 0.04798033204768874,
25+
"p5": 0.07565187888251705,
26+
"p10": 0.19852741925194242,
27+
"p25": 3.5715003423978087,
28+
"p50": 17.172955392453527,
29+
"p75": 36.63458526768415,
30+
"p90": 53.192909502396375,
31+
"p95": 101.34444079000329,
32+
"p99": 285.7324202068485
33+
},
34+
"latency_ms": {
35+
"p1": 0.438,
36+
"p5": 3.433,
37+
"p10": 6.787,
38+
"p25": 11.589,
39+
"p50": 17.712,
40+
"p75": 26.382,
41+
"p90": 38.489,
42+
"p95": 57.061,
43+
"p99": 305.85
44+
},
45+
"packet_loss": {
46+
"p1": 0.0,
47+
"p5": 0.0,
48+
"p10": 0.0,
49+
"p25": 0.0,
50+
"p50": 0.00034573047467282084,
51+
"p75": 0.016581558328885995,
52+
"p90": 0.07073353719313655,
53+
"p95": 0.11517449630011735,
54+
"p99": 0.2521127443846117
55+
}
56+
}
57+
}

data/generate_data.py

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Orchestrate the data generation pipeline for IQB static data.
4+
5+
This script:
6+
1. Runs BigQuery queries for downloads and uploads
7+
2. Merges the results into per-country JSON files
8+
"""
9+
10+
import subprocess
11+
import sys
12+
from pathlib import Path
13+
14+
15+
def run_command(cmd: list[str], description: str) -> None:
16+
"""Run a command and handle errors."""
17+
print(f"\n{'=' * 60}")
18+
print(f"{description}")
19+
print(f"{'=' * 60}")
20+
21+
result = subprocess.run(cmd, capture_output=False)
22+
23+
if result.returncode != 0:
24+
print(f"\n✗ Failed: {description}", file=sys.stderr)
25+
sys.exit(1)
26+
27+
print(f"✓ Completed: {description}")
28+
29+
30+
def main():
31+
# Ensure we're in the data directory
32+
data_dir = Path(__file__).parent
33+
34+
print("IQB Data Generation Pipeline")
35+
print("=" * 60)
36+
37+
# Stage 1a: Query downloads
38+
run_command(
39+
[
40+
"python3",
41+
str(data_dir / "run_query.py"),
42+
str(data_dir / "query_downloads.sql"),
43+
"-o",
44+
str(data_dir / "downloads.json"),
45+
],
46+
"Stage 1a: Querying download metrics (throughput, latency, packet loss)",
47+
)
48+
49+
# Stage 1b: Query uploads
50+
run_command(
51+
[
52+
"python3",
53+
str(data_dir / "run_query.py"),
54+
str(data_dir / "query_uploads.sql"),
55+
"-o",
56+
str(data_dir / "uploads.json"),
57+
],
58+
"Stage 1b: Querying upload metrics (throughput)",
59+
)
60+
61+
# Stage 2: Merge data
62+
run_command(
63+
["python3", str(data_dir / "merge_data.py")],
64+
"Stage 2: Merging download and upload data into per-country files",
65+
)
66+
67+
print("\n" + "=" * 60)
68+
print("✓ Pipeline completed successfully!")
69+
print("=" * 60)
70+
print("\nGenerated files:")
71+
72+
for country in ["us", "de", "br"]:
73+
file_path = data_dir / f"{country}_2024_10.json"
74+
if file_path.exists():
75+
size = file_path.stat().st_size
76+
print(f" - {file_path.name} ({size:,} bytes)")
77+
78+
79+
if __name__ == "__main__":
80+
main()

0 commit comments

Comments
 (0)