BREAKING CHANGES: Update ATLAS (and NOMAD) scrapers#72
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new ATLAS scraper for the MDverse scrapers project and refactors the NOMAD scraper to use a generalized connection testing function. The PR also updates the DOI validation pattern to support DOIs containing forward slashes (required for the ATLAS DOI).
Changes:
- Added new ATLAS scraper with HTML parsing to extract dataset and file metadata from the ATLAS database
- Refactored connection testing by moving
is_nomad_connection_workingto a genericis_connection_to_server_workingfunction - Updated DOI regex pattern to allow forward slashes in addition to existing allowed characters
- Enhanced logger configuration in both ATLAS and NOMAD scrapers to respect debug mode
- Removed old standalone ATLAS scraping script and updated documentation
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/mdverse_scrapers/scrapers/atlas.py | New scraper implementation for ATLAS database with functions to extract PDB chains, scrape metadata, and parse file information from HTML |
| src/mdverse_scrapers/scrapers/nomad.py | Refactored to use generic connection testing function and improved logger initialization based on debug mode |
| src/mdverse_scrapers/core/network.py | Added generic is_connection_to_server_working function to replace NOMAD-specific implementation |
| src/mdverse_scrapers/models/dataset.py | Updated DOI pattern to include forward slash character, allowing DOIs like 10.1093/nar/gkad1084 |
| scripts/scrap_atlas.py | Removed old standalone ATLAS scraper script in favor of integrated scraper module |
| pyproject.toml | Added project metadata (license, authors, classifiers) and registered new scrape-atlas command |
| docs/atlas.md | Updated documentation with clearer structure and comprehensive API endpoint descriptions |
| docs/zenodo.md | Reorganized documentation for better clarity by moving base URL to API section |
| .gitignore | Removed test-related directory entries |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.info("Starting ATLAS data scraping...") | ||
| # Create HTTPX client | ||
| client = create_httpx_client() | ||
| # Check connection to NOMAD API |
There was a problem hiding this comment.
The comment says "Check connection to NOMAD API" but this is the ATLAS scraper, so it should say "Check connection to ATLAS API" instead.
| # Check connection to NOMAD API | |
| # Check connection to ATLAS API |
| { | ||
| "file_name": Path(href).name, | ||
| "file_url_in_repository": href, | ||
| # File size are sometimes expressed with comma as decimal separator. |
There was a problem hiding this comment.
The comment should use plural form: "File sizes are sometimes expressed" instead of "File size are sometimes expressed".
| # File size are sometimes expressed with comma as decimal separator. | |
| # File sizes are sometimes expressed with comma as decimal separator. |
| set[str] | ||
| List of PDB chain identifiers found. |
There was a problem hiding this comment.
The docstring describes the return value as "List of PDB chain identifiers found" but the function returns a set[str], not a list. The description should say "Set of PDB chain identifiers found" for consistency.
src/mdverse_scrapers/core/network.py
Outdated
|
|
||
| def is_connection_to_server_working( | ||
| client: httpx.Client, url: str, logger: "loguru.Logger" = loguru.logger | ||
| ) -> bool | None: |
There was a problem hiding this comment.
The return type annotation indicates bool | None, but the function only returns True or False and never returns None. The return type should be bool instead.
| ) -> bool | None: | |
| ) -> bool: |
| set[str] | ||
| List of PDB chains (datasets) found. |
There was a problem hiding this comment.
The docstring describes the return value as "List of PDB chains (datasets) found" but the function returns a set[str], not a list. The description should say "Set of PDB chains (datasets) found" for consistency.
| try: | ||
| meta_json = response.json().get(f"{chain_id}") | ||
| except (json.decoder.JSONDecodeError, KeyError) as exc: | ||
| logger.warning("Failed to deconde JSON response fromthe ATLAS API.") |
There was a problem hiding this comment.
There are two spelling errors in this log message: "deconde" should be "decode" and "fromthe" should be "from the".
| logger.warning("Failed to deconde JSON response fromthe ATLAS API.") | |
| logger.warning("Failed to decode JSON response from the ATLAS API.") |
No description provided.