Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Intermediate pipeline outputs
/downloads.json
/uploads.json
184 changes: 182 additions & 2 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,183 @@
# Datasets
# IQB Static Data Files

A set of sample datasets.
This directory contains static measurement data used by
the IQB prototype for Phase 1 development.

## Current Dataset

**Period**: October 2024 (2024-10-01 to 2024-10-31)

**Source**: [M-Lab NDT](https://www.measurementlab.net/tests/ndt/) unified views

**Countries**: United States (US), Germany (DE), Brazil (BR)

### Files

- `us_2024_10.json` - United States, ~31M download samples, ~24M upload samples

- `de_2024_10.json` - Germany, ~7M download samples, ~4M upload samples

- `br_2024_10.json` - Brazil, ~5M download samples, ~3M upload samples

### Data Structure

Each JSON file contains:

```JavaScript
{
"metadata": {
"country_code": "US",
"country_name": "United States",
"period": "2024-10",
"period_description": "October 2024",
"dataset": "M-Lab NDT",
"download_samples": 31443312,
"upload_samples": 24288961
},
"metrics": {
"download_throughput_mbps": {"p1": 0.38, /* ... */, "p99": 891.82},
"upload_throughput_mbps": {"p1": 0.06, /* ... */, "p99": 813.73},
"latency_ms": {"p1": 0.16, /* ... */, "p99": 254.34},
"packet_loss": {"p1": 0.0, /* ... */, "p99": 0.25}
}
}
```

**Percentiles included**: p1, p5, p10, p25, p50, p75, p90, p95, p99

## How This Data Was Generated

### BigQuery Queries

The data was extracted from M-Lab's public BigQuery tables using two queries:

1. **Downloads** (`query_downloads.sql`): Queries
`measurement-lab.ndt.unified_downloads` for:

- Download throughput (`a.MeanThroughputMbps`)

- Latency (`a.MinRTT`)

- Packet loss (`a.LossRate`)

2. **Uploads** (`query_uploads.sql`): Queries
`measurement-lab.ndt.unified_uploads` for:

- Upload throughput (`a.MeanThroughputMbps`)

### Running the Data Generation Pipeline

**Prerequisites**:

- Google Cloud SDK (`gcloud`) installed

- BigQuery CLI (`bq`) installed

- `gcloud`-authenticated with an account subscribed to
[M-Lab Discuss mailing list](https://groups.google.com/a/measurementlab.net/g/discuss)

- Python 3.11+

**Complete Pipeline** (recommended):

```bash
cd data/
python3 generate_data.py
```

This orchestrates the complete pipeline:

1. Queries BigQuery for download metrics (throughput, latency, packet loss)

2. Queries BigQuery for upload metrics (throughput)

3. Merges the data into per-country JSON files

Generated files: `us_2024_10.json`, `de_2024_10.json`, `br_2024_10.json`.

**Individual Pipeline Stages** (for debugging):

```bash
cd data/

# Stage 1a: Query downloads
python3 run_query.py query_downloads.sql -o downloads.json

# Stage 1b: Query uploads
python3 run_query.py query_uploads.sql -o uploads.json

# Stage 2: Merge data
python3 merge_data.py
```

**Pipeline Scripts**:

- [generate_data.py](generate_data.py) - Orchestrates the complete pipeline

- [run_query.py](run_query.py) - Executes a BigQuery query and saves results

- [merge_data.py](merge_data.py) - Merges download and upload data into
per-country files

### Modifying Queries

To change the time period or countries, edit the SQL files:

```sql
WHERE
date BETWEEN "2024-10-01" AND "2024-10-31" -- Change dates here
AND client.Geo.CountryCode IN ("US", "DE", "BR") -- Change countries here
```

Country codes follow the
[ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) standard.

## Notes

- **Static data**: These files contain pre-aggregated percentiles
for Phase 1 prototype. Phase 2 will add dynamic data fetching.

- **Time granularity**: Data is aggregated over the entire
month (October 2024). The analyst decides which time window
to use when fethcing data for running IQB calculations.

- **Percentile selection**: The Streamlit UI allows users
to select which percentile(s) to use for IQB score calculations.

- **File size**: Each file is ~1.4KB (uncompressed). No
compression needed.

## M-Lab NDT Data Schema

M-Lab provides two unified views:

- `measurement-lab.ndt.unified_downloads` - Download tests

- `measurement-lab.ndt.unified_uploads` - Upload tests

Key fields used:

- `a.MeanThroughputMbps` - Mean throughput in Mbps

- `a.MinRTT` - Minimum round-trip time in milliseconds

- `a.LossRate` - Packet loss rate (0.0-1.0)

- `client.Geo.CountryCode` - ISO country code

- `date` - Measurement date (YYYY-MM-DD)

See [M-Lab NDT documentation](https://www.measurementlab.net/tests/ndt/#ndt-data-in-bigquery)
for details.

## Future Improvements (Phase 2+)

- Dynamic data fetching from BigQuery

- Support for additional datasets (Ookla, Cloudflare)

- Finer time granularity (daily, weekly)

- Sub-national geographic resolution (cities, ASNs)

- Local database integration for caching aggregated data
57 changes: 57 additions & 0 deletions data/br_2024_10.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
{
"metadata": {
"country_code": "BR",
"country_name": "Brazil",
"period": "2024-10",
"period_description": "October 2024",
"dataset": "M-Lab NDT",
"download_samples": 4944407,
"upload_samples": 3496328
},
"metrics": {
"download_throughput_mbps": {
"p1": 0.15979623373499155,
"p5": 0.9501991252036766,
"p10": 3.101174869710966,
"p25": 15.0340700432778,
"p50": 51.9831305263177,
"p75": 158.38962702858973,
"p90": 330.3352983503099,
"p95": 456.0950392154999,
"p99": 696.5613392781584
},
"upload_throughput_mbps": {
"p1": 0.042563080079753776,
"p5": 0.07560071683921148,
"p10": 0.08980854096320207,
"p25": 5.545812099052701,
"p50": 30.78175191467136,
"p75": 88.37694460346944,
"p90": 181.64033113619195,
"p95": 255.97876412741525,
"p99": 394.3416893812533
},
"latency_ms": {
"p1": 1.394,
"p5": 3.637,
"p10": 4.958,
"p25": 9.079,
"p50": 19.953,
"p75": 52.065,
"p90": 184.738,
"p95": 234.072,
"p99": 273.0
},
"packet_loss": {
"p1": 0.0,
"p5": 0.0,
"p10": 0.0,
"p25": 1.1042755272820004e-05,
"p50": 0.004822712745559209,
"p75": 0.05811090765473097,
"p90": 0.13649207990035975,
"p95": 0.1987869577393624,
"p99": 0.3652163739953438
}
}
}
57 changes: 57 additions & 0 deletions data/de_2024_10.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
{
"metadata": {
"country_code": "DE",
"country_name": "Germany",
"period": "2024-10",
"period_description": "October 2024",
"dataset": "M-Lab NDT",
"download_samples": 7419055,
"upload_samples": 4377008
},
"metrics": {
"download_throughput_mbps": {
"p1": 0.22367850581560372,
"p5": 1.262769802856182,
"p10": 3.4166592054870026,
"p25": 13.817824595534129,
"p50": 45.24430302103892,
"p75": 100.56946051210859,
"p90": 248.78115747983244,
"p95": 377.8657642766346,
"p99": 741.7983223940372
},
"upload_throughput_mbps": {
"p1": 0.04798033204768874,
"p5": 0.07565187888251705,
"p10": 0.19852741925194242,
"p25": 3.5715003423978087,
"p50": 17.172955392453527,
"p75": 36.63458526768415,
"p90": 53.192909502396375,
"p95": 101.34444079000329,
"p99": 285.7324202068485
},
"latency_ms": {
"p1": 0.438,
"p5": 3.433,
"p10": 6.787,
"p25": 11.589,
"p50": 17.712,
"p75": 26.382,
"p90": 38.489,
"p95": 57.061,
"p99": 305.85
},
"packet_loss": {
"p1": 0.0,
"p5": 0.0,
"p10": 0.0,
"p25": 0.0,
"p50": 0.00034573047467282084,
"p75": 0.016581558328885995,
"p90": 0.07073353719313655,
"p95": 0.11517449630011735,
"p99": 0.2521127443846117
}
}
}
80 changes: 80 additions & 0 deletions data/generate_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/usr/bin/env python3
"""
Orchestrate the data generation pipeline for IQB static data.

This script:
1. Runs BigQuery queries for downloads and uploads
2. Merges the results into per-country JSON files
"""

import subprocess
import sys
from pathlib import Path


def run_command(cmd: list[str], description: str) -> None:
"""Run a command and handle errors."""
print(f"\n{'=' * 60}")
print(f"{description}")
print(f"{'=' * 60}")

result = subprocess.run(cmd, capture_output=False)

if result.returncode != 0:
print(f"\n✗ Failed: {description}", file=sys.stderr)
sys.exit(1)

print(f"✓ Completed: {description}")


def main():
# Ensure we're in the data directory
data_dir = Path(__file__).parent

print("IQB Data Generation Pipeline")
print("=" * 60)

# Stage 1a: Query downloads
run_command(
[
"python3",
str(data_dir / "run_query.py"),
str(data_dir / "query_downloads.sql"),
"-o",
str(data_dir / "downloads.json"),
],
"Stage 1a: Querying download metrics (throughput, latency, packet loss)",
)

# Stage 1b: Query uploads
run_command(
[
"python3",
str(data_dir / "run_query.py"),
str(data_dir / "query_uploads.sql"),
"-o",
str(data_dir / "uploads.json"),
],
"Stage 1b: Querying upload metrics (throughput)",
)

# Stage 2: Merge data
run_command(
["python3", str(data_dir / "merge_data.py")],
"Stage 2: Merging download and upload data into per-country files",
)

print("\n" + "=" * 60)
print("✓ Pipeline completed successfully!")
print("=" * 60)
print("\nGenerated files:")

for country in ["us", "de", "br"]:
file_path = data_dir / f"{country}_2024_10.json"
if file_path.exists():
size = file_path.stat().st_size
print(f" - {file_path.name} ({size:,} bytes)")


if __name__ == "__main__":
main()
Loading
Loading