Skip to content

Commit 689fdba

Browse files
authored
feat(library): add ghremote package (#88)
The `ghremote` package implements `pipeline.RemoteCache` using GitHub releases to publish files. The `pipeline.RemoteCache` facility was introduced in #85. The implementation is at `ghremote.IQBGitHubRemoteCache`. We derived the implementation from already-existing code living inside the `./data/ghcache.py` file. We're also adding `dacite` as a dependency, to easily convert parsed JSONs to dataclasses. Using GitHub releases to publish datasets is meant as an interim solution. However, we are also going to implement a better approach for datasets, possibly based on GCS buckets. To this end, we need to refactor the existing approach to create the facilities for making the GCS-based approach possible. Hence, this diff. This diff adds the basic functionality and tests, along with minor changes and tweaks in the rest of the codebase (boyscout rule). We integrated the code added by this PR and verified it is working as intended in #87.
1 parent b44aeea commit 689fdba

File tree

11 files changed

+635
-11
lines changed

11 files changed

+635
-11
lines changed

library/README.md

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ iqb.print_config()
4040

4141
## Running Tests
4242

43-
The library uses pytest for testing. Tests are located in the `tests/` directory and follow the `*_test.py` naming convention.
43+
The library uses `pytest` for testing. Tests are located in the `tests/`
44+
directory and follow the `*_test.py` naming convention.
4445

4546
```bash
4647
# From the repository root, sync dev dependencies
@@ -59,15 +60,20 @@ uv run pytest tests/iqb_score_test.py
5960
# Run specific test class or function
6061
uv run pytest tests/iqb_score_test.py::TestIQBInitialization
6162
uv run pytest tests/iqb_score_test.py::TestIQBInitialization::test_init_with_name
63+
64+
# Get coverage
65+
uv run pytest --cov=.
6266
```
6367

6468
## Code Quality Tools
6569

66-
The library uses Ruff for linting/formatting and Pyright for type checking.
70+
The library uses `ruff` for linting/formatting and
71+
`pyright` for type checking.
6772

6873
### Linting with Ruff
6974

70-
Ruff is a fast Python linter that checks code style, potential bugs, and code quality issues:
75+
Ruff is a fast Python linter that checks code style, potential
76+
bugs, and code quality issues:
7177

7278
```bash
7379
# From the library directory
@@ -105,7 +111,8 @@ uv run pyright --verbose
105111

106112
**Verifying Pyright is Working:**
107113

108-
Pyright can be silent if misconfigured. To verify it's actually checking your code:
114+
Pyright can be silent if misconfigured. To verify it's actually
115+
checking your code:
109116

110117
```bash
111118
# This should show which files are being analyzed
@@ -118,7 +125,7 @@ uv run pyright --verbose
118125
# - "X errors, Y warnings, Z informations"
119126
```
120127

121-
If you see "Found 0 source files", the configuration is wrong.
128+
If you see `"Found 0 source files"`, the configuration is wrong.
122129

123130
To test that Pyright catches errors, temporarily introduce a type error:
124131

@@ -127,15 +134,17 @@ To test that Pyright catches errors, temporarily introduce a type error:
127134
x: int = "this should fail" # Pyright should catch this!
128135
```
129136

130-
If Pyright reports an error, it's working correctly. Remove the test line afterwards.
137+
If Pyright reports an error, it's working correctly. Remove the
138+
test line afterwards.
131139

132140
Pyright configuration is in `pyproject.toml` under `[tool.pyright]`.
133141

134142
## Development
135143

136144
### Adding New Tests
137145

138-
Create new test files in the `tests/` directory following the naming pattern `*_test.py`:
146+
Create new test files in the `tests/` directory following the
147+
naming pattern `*_test.py`:
139148

140149
```python
141150
"""tests/my_feature_test.py"""
@@ -151,4 +160,6 @@ class TestMyFeature:
151160

152161
### Running Tests in CI
153162

154-
Tests run automatically on all pushes and pull requests to the main branch via GitHub Actions. See `.github/workflows/ci.yml` for the CI configuration.
163+
Tests run automatically on all pushes and pull requests to the
164+
`main` branch via GitHub Actions. See `.github/workflows/ci.yml`
165+
for the CI configuration.

library/pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ dependencies = [
1717
"pandas>=2.0.0",
1818
"db-dtypes>=1.0.0",
1919
"python-dateutil>=2.9.0.post0",
20+
"dacite>=1.9.2",
2021
]
2122

2223
[project.urls]
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
"""
2+
GitHub remote cache synchronization tool for IQB data files.
3+
4+
This is a throwaway module for the initial phase of the project. It will
5+
eventually be replaced by a proper GCS-based solution.
6+
7+
Manifest format:
8+
9+
{
10+
"v": 0,
11+
"files": {
12+
"cache/v1/.../data.parquet": {
13+
"sha256": "3a421c62179a...",
14+
"url": "https://github.com/.../3a421c62179a__cache__v1__...parquet"
15+
}
16+
}
17+
}
18+
"""
19+
20+
from .cache import (
21+
IQBGitHubRemoteCache,
22+
iqb_github_load_manifest,
23+
)
24+
25+
__all__ = [
26+
"IQBGitHubRemoteCache",
27+
"iqb_github_load_manifest",
28+
]

library/src/iqb/ghremote/cache.py

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
"""Module containing the RemoteCache implementation."""
2+
3+
from __future__ import annotations
4+
5+
import hashlib
6+
import json
7+
import logging
8+
import os
9+
from dataclasses import dataclass, field
10+
from pathlib import Path
11+
from urllib.request import urlopen
12+
13+
from dacite import from_dict
14+
15+
from ..pipeline.cache import PipelineCacheEntry
16+
17+
18+
@dataclass(frozen=True)
19+
class FileEntry:
20+
"""Entry in the manifest for a single cached file."""
21+
22+
sha256: str
23+
url: str
24+
25+
26+
@dataclass(frozen=True)
27+
class Manifest:
28+
"""Manifest for cached files stored in GitHub releases."""
29+
30+
v: int
31+
files: dict[str, FileEntry] = field(default_factory=dict)
32+
33+
def __post_init__(self):
34+
if self.v != 0:
35+
raise ValueError(f"Unsupported manifest version: {self.v} (only v=0 supported)")
36+
37+
def get_file_entry(self, *, full_path: Path, data_dir: Path) -> FileEntry:
38+
"""
39+
Return the file entry corresponding to a given full path and data directory.
40+
41+
Raises:
42+
KeyError: if the given remote entry does not exist.
43+
"""
44+
# Use .as_posix() to ensure forward slashes for cross-platform compatibility
45+
# Manifest keys should always use forward slashes regardless of OS
46+
key = full_path.relative_to(data_dir).as_posix()
47+
try:
48+
return self.files[key]
49+
except KeyError as exc:
50+
raise KeyError(f"no remotely-cached file for {key}") from exc
51+
52+
53+
def iqb_github_load_manifest(manifest_file: Path) -> Manifest:
54+
"""Load manifest from the given file, or return empty manifest if not found."""
55+
if not manifest_file.exists():
56+
return Manifest(v=0, files={})
57+
58+
with open(manifest_file) as filep:
59+
data = json.load(filep)
60+
61+
return from_dict(Manifest, data)
62+
63+
64+
class IQBGitHubRemoteCache:
65+
"""
66+
Remote cache for query results using GitHub releases.
67+
68+
This class implements the pipeline.RemoteCache protocol.
69+
"""
70+
71+
def __init__(self, manifest: Manifest) -> None:
72+
self.manifest = manifest
73+
74+
def sync(self, entry: PipelineCacheEntry) -> bool:
75+
"""
76+
Sync remote cache entry to disk and return whether
77+
we successfully synced it or not. Emits logging messages
78+
explaining what it is doing and warning about issues
79+
occurred while trying to sync from the remote.
80+
"""
81+
try:
82+
logging.info(f"ghremote: syncing {entry}... start")
83+
self._sync(entry)
84+
logging.info(f"ghremote: syncing {entry}... ok")
85+
return True
86+
except Exception as exc:
87+
logging.warning(f"ghremote: syncing {entry}... failure: {exc}")
88+
return False
89+
90+
def _sync(self, entry: PipelineCacheEntry):
91+
# Lookup files in the manifest using pipeline-provided paths
92+
# so we don't need to revalidate them again.
93+
parquet_entry = self.manifest.get_file_entry(
94+
full_path=entry.data_parquet_file_path(),
95+
data_dir=entry.data_dir,
96+
)
97+
json_entry = self.manifest.get_file_entry(
98+
full_path=entry.stats_json_file_path(),
99+
data_dir=entry.data_dir,
100+
)
101+
102+
# Sync both entries given preference to the JSON since it's smaller
103+
# and leads to less wasted bandwidth if the parquet doesn't exist.
104+
_sync_file_entry(json_entry, entry.stats_json_file_path())
105+
_sync_file_entry(parquet_entry, entry.data_parquet_file_path())
106+
107+
108+
def _sync_file_entry(entry: FileEntry, dest_path: Path):
109+
"""Sync the given FileEntry with the file cached in a GitHub release."""
110+
# Determine whether we need to download again
111+
exists = dest_path.exists()
112+
if not exists or entry.sha256 != _compute_sha256(dest_path):
113+
# If old file exists, remove it
114+
if exists:
115+
os.unlink(dest_path)
116+
117+
# Download into the destination file directly
118+
logging.info(f"ghremote: fetching {entry}... start")
119+
dest_path.parent.mkdir(parents=True, exist_ok=True)
120+
with urlopen(entry.url) as response, open(dest_path, "wb") as fp:
121+
while chunk := response.read(8192):
122+
fp.write(chunk)
123+
logging.info(f"ghremote: fetching {entry}... ok")
124+
125+
# Make sure the sha256 matches
126+
logging.info(f"ghremote: validating {entry}... start")
127+
sha256 = _compute_sha256(dest_path)
128+
if sha256 != entry.sha256:
129+
os.unlink(dest_path)
130+
raise ValueError(f"SHA256 mismatch: expected {entry.sha256}, got {sha256}")
131+
logging.info(f"ghremote: validating {entry}... ok")
132+
133+
134+
def _compute_sha256(path: Path) -> str:
135+
"""Compute SHA256 hash of a file."""
136+
sha256 = hashlib.sha256()
137+
with open(path, "rb") as fp:
138+
while chunk := fp.read(8192):
139+
sha256.update(chunk)
140+
return sha256.hexdigest()

library/src/iqb/pipeline/bqpq.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,8 +60,9 @@ def save_data_parquet(self) -> Path:
6060
parquet_path = self.paths_provider.data_parquet_file_path()
6161
parquet_path.parent.mkdir(parents=True, exist_ok=True)
6262

63-
# Note: using .as_posix to avoid paths with backslashes
64-
# that can cause issues with PyArrow on Windows
63+
# Note: using .as_posix() for consistency with forward slashes across platforms.
64+
# Modern PyArrow likely handles native Windows paths fine, but this is defensive
65+
# programming that ensures compatibility without harm.
6566
posix_path = parquet_path.as_posix()
6667

6768
# Access the first batch to obtain the schema

library/tests/iqb/cache/cache_test.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
"""Tests for the IQBCache data fetching module."""
1+
"""Tests for the iqb.cache.cache module."""
22

33
from datetime import datetime
44

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Tests for the iqb.ghremote package."""

0 commit comments

Comments
 (0)