Skip to content

Conversation

@oesteban
Copy link
Member

Summary

  • add a streaming CLI to migrate MongoDB event collections into the partitioned Parquet dataset
  • expose reusable manifest helpers in src/api.py so the CLI matches the fetcher’s normalization
  • document the migration workflow and add the required optional dependencies

Testing

  • python -m compileall scripts/migrate_mongo_to_parquet.py src/api.py
  • python - <<'PY'
    from click.testing import CliRunner
    from datetime import datetime, timezone
    from pathlib import Path
    from tempfile import TemporaryDirectory
    from unittest.mock import patch

import mongomock

from scripts.migrate_mongo_to_parquet import main as migrate
from src import api

client = mongomock.MongoClient()
db = client['fmriprep_stats']
now = datetime(2024, 1, 15, 12, 0, tzinfo=timezone.utc)

for idx, event in enumerate(api.ISSUES, 1):
coll = db[event]
for i in range(3):
coll.insert_one(
{
"_id": f"mongo-{event}-{i}",
"id": f"{event}-{i}",
"dateCreated": now.isoformat(),
"tags": [
{"key": "environment", "value": "prod"},
{"key": "platform", "value": "linux"},
],
}
)

runner = CliRunner()

with TemporaryDirectory() as tmpdir:
dataset_root = Path(tmpdir) / "dataset"

with patch("scripts.migrate_mongo_to_parquet.MongoClient", lambda uri: client):
    result1 = runner.invoke(migrate, ["--mongo-uri", "mongodb://localhost", "--db-name", "fmriprep_stats", str(dataset_root)])
    result2 = runner.invoke(migrate, ["--mongo-uri", "mongodb://localhost", "--db-name", "fmriprep_stats", str(dataset_root)])

assert result1.exit_code == 0, result1.output
assert result2.exit_code == 0, result2.output

manifest_path = api._manifest_path(dataset_root)
manifest = api._load_manifest(manifest_path)
assert len(manifest) == len(api.ISSUES) * 3
assert manifest['id'].nunique() == len(api.ISSUES) * 3

files = list(dataset_root.rglob('*.parquet'))
print('files', len(files))
print('rows', len(manifest))

PY


https://chatgpt.com/codex/tasks/task_e_68db981765988330bb8574b8d28c19ce

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@oesteban oesteban force-pushed the codex/create-mongodb-to-parquet-migration-script branch from 9d28de4 to f802164 Compare November 18, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants