Skip to content

Commit ed1d11f

Browse files
authored
refactor(data): move SQL queries to library package (#29)
Move BigQuery query templates from `./data/` to `./library/src/iqb/queries/` to make them officially part of the library. Benefits: 1. Queries are now package resources (can be imported) 2. Cleaner separation: queries belong with library code 3. Prepares for future pipeline refactoring No functional changes - same queries, same results confirmed. While there, maintain the documentation of the `./data` directory syncing it up with reality.
1 parent a0cf0b9 commit ed1d11f

File tree

7 files changed

+170
-56
lines changed

7 files changed

+170
-56
lines changed

data/README.md

Lines changed: 9 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -19,21 +19,14 @@ the IQB prototype for Phase 1 development.
1919

2020
- `br_2024_10.json` - Brazil, ~5M download samples, ~3M upload samples
2121

22+
- ... (more files)
23+
2224
### Data Structure
2325

2426
Each JSON file contains:
2527

2628
```JavaScript
2729
{
28-
"metadata": {
29-
"country_code": "US",
30-
"country_name": "United States",
31-
"period": "2024-10",
32-
"period_description": "October 2024",
33-
"dataset": "M-Lab NDT",
34-
"download_samples": 31443312,
35-
"upload_samples": 24288961
36-
},
3730
"metrics": {
3831
"download_throughput_mbps": {"p1": 0.38, /* ... */, "p99": 891.82},
3932
"upload_throughput_mbps": {"p1": 0.06, /* ... */, "p99": 813.73},
@@ -49,21 +42,8 @@ Each JSON file contains:
4942

5043
### BigQuery Queries
5144

52-
The data was extracted from M-Lab's public BigQuery tables using two queries:
53-
54-
1. **Downloads** (`query_downloads.sql`): Queries
55-
`measurement-lab.ndt.unified_downloads` for:
56-
57-
- Download throughput (`a.MeanThroughputMbps`)
58-
59-
- Latency (`a.MinRTT`)
60-
61-
- Packet loss (`a.LossRate`)
62-
63-
2. **Uploads** (`query_uploads.sql`): Queries
64-
`measurement-lab.ndt.unified_uploads` for:
65-
66-
- Upload throughput (`a.MeanThroughputMbps`)
45+
The data was extracted from M-Lab's public BigQuery tables using queries
46+
inside the [../library/src/iqb/queries](../library/src/iqb/queries) package.
6747

6848
### Running the Data Generation Pipeline
6949

@@ -76,13 +56,13 @@ The data was extracted from M-Lab's public BigQuery tables using two queries:
7656
- `gcloud`-authenticated with an account subscribed to
7757
[M-Lab Discuss mailing list](https://groups.google.com/a/measurementlab.net/g/discuss)
7858

79-
- Python 3.11+
59+
- Python 3.13 using `uv` as documented in the toplevel [README.md](../README.md)
8060

8161
**Complete Pipeline** (recommended):
8262

8363
```bash
8464
cd data/
85-
python3 generate_data.py
65+
uv run python generate_data.py
8666
```
8767

8868
This orchestrates the complete pipeline:
@@ -101,13 +81,13 @@ Generated files: `us_2024_10.json`, `de_2024_10.json`, `br_2024_10.json`.
10181
cd data/
10282

10383
# Stage 1a: Query downloads
104-
python3 run_query.py query_downloads.sql -o downloads.json
84+
uv run python run_query.py query_downloads.sql -o downloads.json
10585

10686
# Stage 1b: Query uploads
107-
python3 run_query.py query_uploads.sql -o uploads.json
87+
uv run python run_query.py query_uploads.sql -o uploads.json
10888

10989
# Stage 2: Merge data
110-
python3 merge_data.py
90+
uv run python merge_data.py
11191
```
11292

11393
**Pipeline Scripts**:
@@ -119,19 +99,6 @@ python3 merge_data.py
11999
- [merge_data.py](merge_data.py) - Merges download and upload data into
120100
per-country files
121101

122-
### Modifying Queries
123-
124-
To change the time period or countries, edit the SQL files:
125-
126-
```sql
127-
WHERE
128-
date BETWEEN "2024-10-01" AND "2024-10-31" -- Change dates here
129-
AND client.Geo.CountryCode IN ("US", "DE", "BR") -- Change countries here
130-
```
131-
132-
Country codes follow the
133-
[ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) standard.
134-
135102
## Notes
136103

137104
- **Static data**: These files contain pre-aggregated percentiles

data/generate_data.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ def generate_for_period(
4949
[
5050
"python3",
5151
str(data_dir / "run_query.py"),
52-
str(data_dir / "query_downloads_template.sql"),
52+
"downloads_by_country",
5353
"--start-date",
5454
start_date,
5555
"--end-date",
@@ -65,7 +65,7 @@ def generate_for_period(
6565
[
6666
"python3",
6767
str(data_dir / "run_query.py"),
68-
str(data_dir / "query_uploads_template.sql"),
68+
"uploads_by_country",
6969
"--start-date",
7070
start_date,
7171
"--end-date",

data/run_query.py

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,13 @@
55
import subprocess
66
import sys
77
from datetime import datetime
8+
from importlib.resources import files
89
from pathlib import Path
910

11+
# Add library to path so we can import iqb.queries
12+
sys.path.insert(0, str(Path(__file__).parent.parent / "library" / "src"))
13+
import iqb.queries
14+
1015

1116
def validate_date(date_str: str) -> str:
1217
"""
@@ -31,7 +36,7 @@ def validate_date(date_str: str) -> str:
3136

3237

3338
def run_bq_query(
34-
query_file: Path,
39+
query_name: str,
3540
output_file: Path | None,
3641
project_id: str,
3742
start_date: str,
@@ -55,7 +60,7 @@ def run_bq_query(
5560
Template becomes: date >= '2024-10-01' AND date < '2024-11-01'
5661
5762
Args:
58-
query_file: Path to SQL query template file
63+
query_name: Name of SQL query template (e.g., "downloads_by_country")
5964
output_file: Path where to save JSON output (None = stdout)
6065
project_id: GCP project ID for billing
6166
start_date: Start date in YYYY-MM-DD format (inclusive)
@@ -70,12 +75,12 @@ def run_bq_query(
7075
f"start_date must be <= end_date, got: {start_date} > {end_date}"
7176
)
7277

73-
print(f"Running query: {query_file}", file=sys.stderr)
78+
print(f"Running query: {query_name}", file=sys.stderr)
7479
print(f" Date range: {start_date} to {end_date}", file=sys.stderr)
7580

76-
# Read query template
77-
with open(query_file) as f:
78-
query = f.read()
81+
# Load query template from iqb.queries package
82+
query_file = files(iqb.queries).joinpath(f"{query_name}.sql")
83+
query = query_file.read_text()
7984

8085
# Substitute template variables
8186
query = query.replace("{START_DATE}", start_date)
@@ -125,7 +130,10 @@ def main():
125130
parser = argparse.ArgumentParser(
126131
description="Execute BigQuery query template and save results"
127132
)
128-
parser.add_argument("query_file", type=Path, help="Path to SQL query template file")
133+
parser.add_argument(
134+
"query_name",
135+
help="Name of SQL query template (e.g., 'downloads_by_country', 'uploads_by_country')",
136+
)
129137
parser.add_argument(
130138
"-o", "--output", type=Path, help="Path to output JSON file (default: stdout)"
131139
)
@@ -147,12 +155,8 @@ def main():
147155

148156
args = parser.parse_args()
149157

150-
if not args.query_file.exists():
151-
print(f"Error: Query file not found: {args.query_file}", file=sys.stderr)
152-
sys.exit(1)
153-
154158
run_bq_query(
155-
args.query_file,
159+
args.query_name,
156160
args.output,
157161
args.project_id,
158162
args.start_date,
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""SQL query templates for IQB data collection.
2+
3+
This module contains SQL query templates for fetching measurement data
4+
from BigQuery NDT tables.
5+
"""
File renamed without changes.
File renamed without changes.

library/tests/iqb/queries_test.py

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
"""Tests for the iqb.queries SQL query templates module."""
2+
3+
from importlib.resources import files
4+
5+
import iqb.queries
6+
7+
8+
class TestQueriesPackage:
9+
"""Tests for iqb.queries package structure."""
10+
11+
def test_queries_package_can_be_imported(self):
12+
"""Test that iqb.queries package can be imported."""
13+
assert iqb.queries is not None
14+
15+
def test_queries_package_has_files(self):
16+
"""Test that queries package provides access to files."""
17+
query_files = files(iqb.queries)
18+
assert query_files is not None
19+
20+
21+
class TestDownloadsByCountryQuery:
22+
"""Tests for downloads_by_country.sql query template."""
23+
24+
def test_downloads_by_country_exists(self):
25+
"""Test that downloads_by_country.sql query file exists."""
26+
query_file = files(iqb.queries).joinpath("downloads_by_country.sql")
27+
assert query_file.is_file()
28+
29+
def test_downloads_by_country_can_be_read(self):
30+
"""Test that downloads_by_country.sql can be read."""
31+
query_file = files(iqb.queries).joinpath("downloads_by_country.sql")
32+
content = query_file.read_text()
33+
assert content is not None
34+
assert len(content) > 0
35+
36+
def test_downloads_by_country_has_date_placeholders(self):
37+
"""Test that downloads_by_country.sql contains date placeholders."""
38+
query_file = files(iqb.queries).joinpath("downloads_by_country.sql")
39+
content = query_file.read_text()
40+
assert "{START_DATE}" in content
41+
assert "{END_DATE}" in content
42+
43+
def test_downloads_by_country_queries_unified_downloads_table(self):
44+
"""Test that downloads_by_country.sql queries the correct table."""
45+
query_file = files(iqb.queries).joinpath("downloads_by_country.sql")
46+
content = query_file.read_text()
47+
assert "measurement-lab.ndt.unified_downloads" in content
48+
49+
def test_downloads_by_country_groups_by_country_code(self):
50+
"""Test that downloads_by_country.sql groups by country code."""
51+
query_file = files(iqb.queries).joinpath("downloads_by_country.sql")
52+
content = query_file.read_text()
53+
assert "GROUP BY country_code" in content
54+
55+
def test_downloads_by_country_calculates_percentiles(self):
56+
"""Test that downloads_by_country.sql calculates percentiles."""
57+
query_file = files(iqb.queries).joinpath("downloads_by_country.sql")
58+
content = query_file.read_text()
59+
# Should calculate p1, p5, p10, p25, p50, p75, p90, p95, p99
60+
assert "APPROX_QUANTILES" in content
61+
assert "download_p95" in content or "download_p99" in content
62+
63+
64+
class TestUploadsByCountryQuery:
65+
"""Tests for uploads_by_country.sql query template."""
66+
67+
def test_uploads_by_country_exists(self):
68+
"""Test that uploads_by_country.sql query file exists."""
69+
query_file = files(iqb.queries).joinpath("uploads_by_country.sql")
70+
assert query_file.is_file()
71+
72+
def test_uploads_by_country_can_be_read(self):
73+
"""Test that uploads_by_country.sql can be read."""
74+
query_file = files(iqb.queries).joinpath("uploads_by_country.sql")
75+
content = query_file.read_text()
76+
assert content is not None
77+
assert len(content) > 0
78+
79+
def test_uploads_by_country_has_date_placeholders(self):
80+
"""Test that uploads_by_country.sql contains date placeholders."""
81+
query_file = files(iqb.queries).joinpath("uploads_by_country.sql")
82+
content = query_file.read_text()
83+
assert "{START_DATE}" in content
84+
assert "{END_DATE}" in content
85+
86+
def test_uploads_by_country_queries_unified_uploads_table(self):
87+
"""Test that uploads_by_country.sql queries the correct table."""
88+
query_file = files(iqb.queries).joinpath("uploads_by_country.sql")
89+
content = query_file.read_text()
90+
assert "measurement-lab.ndt.unified_uploads" in content
91+
92+
def test_uploads_by_country_groups_by_country_code(self):
93+
"""Test that uploads_by_country.sql groups by country code."""
94+
query_file = files(iqb.queries).joinpath("uploads_by_country.sql")
95+
content = query_file.read_text()
96+
assert "GROUP BY country_code" in content
97+
98+
def test_uploads_by_country_calculates_percentiles(self):
99+
"""Test that uploads_by_country.sql calculates percentiles."""
100+
query_file = files(iqb.queries).joinpath("uploads_by_country.sql")
101+
content = query_file.read_text()
102+
# Should calculate p1, p5, p10, p25, p50, p75, p90, p95, p99
103+
assert "APPROX_QUANTILES" in content
104+
assert "upload_p95" in content or "upload_p99" in content
105+
106+
107+
class TestQueryTemplateSubstitution:
108+
"""Tests for query template placeholder substitution."""
109+
110+
def test_downloads_query_date_substitution(self):
111+
"""Test that date placeholders can be substituted in downloads query."""
112+
query_file = files(iqb.queries).joinpath("downloads_by_country.sql")
113+
template = query_file.read_text()
114+
115+
# Substitute placeholders
116+
query = template.replace("{START_DATE}", "2024-10-01")
117+
query = query.replace("{END_DATE}", "2024-11-01")
118+
119+
# Verify substitution worked
120+
assert "{START_DATE}" not in query
121+
assert "{END_DATE}" not in query
122+
assert "2024-10-01" in query
123+
assert "2024-11-01" in query
124+
125+
def test_uploads_query_date_substitution(self):
126+
"""Test that date placeholders can be substituted in uploads query."""
127+
query_file = files(iqb.queries).joinpath("uploads_by_country.sql")
128+
template = query_file.read_text()
129+
130+
# Substitute placeholders
131+
query = template.replace("{START_DATE}", "2024-10-01")
132+
query = query.replace("{END_DATE}", "2024-11-01")
133+
134+
# Verify substitution worked
135+
assert "{START_DATE}" not in query
136+
assert "{END_DATE}" not in query
137+
assert "2024-10-01" in query
138+
assert "2024-11-01" in query

0 commit comments

Comments
 (0)