Skip to content

Commit fb490c4

Browse files
authored
Merge pull request #208 from GeoinformationSystems/feature/oai_pmh_harvesting
Added Scythe and updated test case
2 parents 6475373 + 8705efa commit fb490c4

File tree

16 files changed

+2009
-277
lines changed

16 files changed

+2009
-277
lines changed

.claude/settings.local.json

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"Bash(tee:*)",
5+
"Bash(git checkout:*)",
6+
"Bash(pip install:*)",
7+
"Bash(gh issue view:*)",
8+
"Bash(pytest:*)",
9+
"Bash(pip search:*)",
10+
"Bash(psql:*)"
11+
],
12+
"deny": [],
13+
"ask": []
14+
}
15+
}

CHANGELOG.md

Lines changed: 55 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,27 +4,76 @@
44

55
### Added
66

7-
- ...
7+
- **RSS/Atom feed harvesting support** (`publications/tasks.py`)
8+
- `parse_rss_feed_and_save_publications()` function for parsing RSS/Atom feeds
9+
- `harvest_rss_endpoint()` function for complete RSS harvesting workflow
10+
- Support for RDF-based RSS feeds (Scientific Data journal)
11+
- DOI extraction from multiple feed fields (prism:doi, dc:identifier)
12+
- Duplicate detection by DOI and URL
13+
- Abstract/description extraction from feed content
14+
- feedparser library integration (v6.0.12)
15+
- Added to requirements.txt for RSS/Atom feed parsing
16+
- Supports RSS 1.0/2.0, Atom, and RDF feeds
17+
- Django management command `harvest_journals` enhanced for RSS/Atom feeds
18+
- Added Scientific Data journal with RSS feed support
19+
- Support for both OAI-PMH and RSS/Atom feed types
20+
- Automatic feed type detection based on journal configuration
21+
- Now supports 4 journals: ESSD, AGILE-GISS, GEO-LEO (OAI-PMH), Scientific Data (RSS)
22+
- Comprehensive RSS harvesting tests (`RSSFeedHarvestingTests`)
23+
- 7 test cases covering RSS parsing, duplicate detection, error handling
24+
- Test fixture with sample RDF/RSS feed (`tests/harvesting/rss_feed_sample.xml`)
25+
- Tests for max_records limit, invalid feeds, and HTTP errors
26+
- Django management command `harvest_journals` for harvesting real journal sources
27+
- Command-line options for journal selection, record limits, and source creation
28+
- Detailed progress reporting with colored output
29+
- Statistics for spatial/temporal metadata extraction
30+
- Integration tests for real journal harvesting (`tests/test_real_harvesting.py`)
31+
- 6 tests covering ESSD, AGILE-GISS, GEO-LEO, and EssOAr
32+
- Tests skipped by default (use `SKIP_REAL_HARVESTING=0` to enable)
33+
- Max records parameter to limit harvesting for testing
34+
- Comprehensive error handling tests for OAI-PMH harvesting (`HarvestingErrorTests`)
35+
- 10 test cases covering malformed XML, missing metadata, HTTP errors, network timeouts
36+
- Test fixtures for various error conditions in `tests/harvesting/error_cases/`
37+
- Verification of graceful error handling and logging
38+
- pytest configuration with custom markers (`pytest.ini`)
39+
- `real_harvesting` marker for integration tests
40+
- Configuration for Django test discovery
841

942
### Changed
1043

11-
- ...
44+
- Fixed OAI-PMH harvesting test failures by updating response format parameters
45+
- Changed from invalid 'structured'/'raw' to valid 'geojson'/'wkt'/'wkb' formats
46+
- Updated test assertions to expect GeoJSON FeatureCollection
47+
- Fixed syntax errors in `publications/tasks.py`
48+
- Fixed import statement typo
49+
- Fixed indentation in `extract_timeperiod_from_html` function
50+
- Fixed misplaced return statement in `regenerate_geopackage_cache` function
51+
- Fixed test setup method in `tests/test_harvesting.py`
52+
- Removed incorrect `@classmethod` decorator from `setUp` method
53+
- Fixed `test_regular_harvesting.py` to include `max_records` parameter in mock function
54+
- Updated README.md with comprehensive documentation for:
55+
- Integration test execution
56+
- `harvest_journals` management command usage
57+
- Journal harvesting workflows
1258

1359
### Fixed
1460

15-
- ...
61+
- Docker build for geoextent installation (added git dependency to Dockerfile)
62+
- 18 geoextent API test failures due to invalid response format values
63+
- 8 test setup errors in OAI-PMH harvesting tests
64+
- Test harvesting function signature mismatch
1665

1766
### Deprecated
1867

19-
- ...
68+
- None.
2069

2170
### Removed
2271

23-
- ...
72+
- None.
2473

2574
### Security
2675

27-
- ...
76+
- None.
2877

2978
## [0.2.0] - 2025-10-09
3079

README.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,11 @@ python manage.py qcluster
144144
# If you want to use the predefined feeds for continents and oceans we need to load the geometries for global regions
145145
python manage.py load_global_regions
146146

147+
# Harvest publications from real OAI-PMH journal sources
148+
python manage.py harvest_journals --list # List available journals
149+
python manage.py harvest_journals --all --max-records 20 # Harvest all journals (limited to 20 records each)
150+
python manage.py harvest_journals --journal essd --journal geo-leo # Harvest specific journals
151+
147152
# Start the Django development server
148153
python manage.py runserver
149154

@@ -233,6 +238,66 @@ OPTIMAP_EMAIL_PORT=5587
233238

234239
Visit the URL - http://127.0.0.1:8000/articles/links/
235240

241+
### Harvest Publications from Real Journals
242+
243+
The `harvest_journals` management command allows you to harvest publications from real OAI-PMH journal sources directly into your database. This is useful for:
244+
245+
- Populating your database with real data for testing and development
246+
- Testing harvesting functionality against live endpoints
247+
- Initial data loading for production deployment
248+
249+
**List available journals**:
250+
251+
```bash
252+
python manage.py harvest_journals --list
253+
```
254+
255+
**Harvest all configured journals** (with record limit):
256+
257+
```bash
258+
python manage.py harvest_journals --all --max-records 50
259+
```
260+
261+
**Harvest specific journals**:
262+
263+
```bash
264+
# Single journal
265+
python manage.py harvest_journals --journal essd --max-records 100
266+
267+
# Multiple journals
268+
python manage.py harvest_journals --journal essd --journal geo-leo --journal agile-giss
269+
```
270+
271+
**Create source entries automatically**:
272+
273+
```bash
274+
python manage.py harvest_journals --journal essd --create-sources
275+
```
276+
277+
**Associate with specific user**:
278+
279+
```bash
280+
python manage.py harvest_journals --all --user-email [email protected]
281+
```
282+
283+
**Currently configured journals**:
284+
285+
- `essd` - Earth System Science Data (OAI-PMH) ([Issue #59](https://github.com/GeoinformationSystems/optimap/issues/59))
286+
- `agile-giss` - AGILE-GISS conference series (OAI-PMH) ([Issue #60](https://github.com/GeoinformationSystems/optimap/issues/60))
287+
- `geo-leo` - GEO-LEO e-docs repository (OAI-PMH) ([Issue #13](https://github.com/GeoinformationSystems/optimap/issues/13))
288+
- `scientific-data` - Scientific Data (RSS/Atom) ([Issue #58](https://github.com/GeoinformationSystems/optimap/issues/58))
289+
290+
The command supports both OAI-PMH and RSS/Atom feeds, automatically detecting the feed type for each journal.
291+
292+
The command provides detailed progress reporting including:
293+
294+
- Number of publications harvested
295+
- Harvesting duration
296+
- Spatial and temporal metadata statistics
297+
- Success/failure status for each journal
298+
299+
When the command runs mutiple times, it will only add new publications that are not already in the database as part of the regular harvesting process.
300+
236301
### Create Superusers/Admin
237302

238303
Superusers or administrators can be created using the `createsuperuser` command. This user will have access to the Django admin interface.
@@ -265,6 +330,10 @@ UI tests are based on [Helium](https://github.com/mherrmann/selenium-python-heli
265330
pip install -r requirements-dev.txt
266331
```
267332

333+
#### Unit Tests
334+
335+
Run all unit tests:
336+
268337
```bash
269338
python manage.py test tests
270339

@@ -275,6 +344,41 @@ python -Wa manage.py test
275344
OPTIMAP_LOGGING_LEVEL=WARNING python manage.py test tests
276345
```
277346

347+
#### Integration Tests (Real Harvesting)
348+
349+
Integration tests that harvest from live OAI-PMH endpoints are disabled by default to avoid network dependencies and slow test execution. These tests verify harvesting from real journal sources.
350+
351+
Run all integration tests:
352+
353+
```bash
354+
# Enable real harvesting tests
355+
SKIP_REAL_HARVESTING=0 python manage.py test tests.test_real_harvesting
356+
```
357+
358+
Run a specific journal test:
359+
360+
```bash
361+
# Test ESSD harvesting
362+
SKIP_REAL_HARVESTING=0 python manage.py test tests.test_real_harvesting.RealHarvestingTest.test_harvest_essd
363+
364+
# Test GEO-LEO harvesting
365+
SKIP_REAL_HARVESTING=0 python manage.py test tests.test_real_harvesting.RealHarvestingTest.test_harvest_geo_leo
366+
```
367+
368+
Show skipped tests (these are skipped by default):
369+
370+
```bash
371+
# Run with verbose output to see skip reasons
372+
python manage.py test tests.test_real_harvesting -v 2
373+
```
374+
375+
**Supported journals**:
376+
377+
- Earth System Science Data (ESSD) - [Issue #59](https://github.com/GeoinformationSystems/optimap/issues/59)
378+
- AGILE-GISS - [Issue #60](https://github.com/GeoinformationSystems/optimap/issues/60)
379+
- GEO-LEO e-docs - [Issue #13](https://github.com/GeoinformationSystems/optimap/issues/13)
380+
- ESS Open Archive (EssOAr) - [Issue #99](https://github.com/GeoinformationSystems/optimap/issues/99) _(endpoint needs confirmation)_
381+
278382
### Run UI tests
279383

280384
Running UI tests needs either compose configuration or a manage.py runserver in a seperate shell.

optimap/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
__version__ = "0.2.0"
1+
__version__ = "0.3.0"
22
VERSION = __version__

0 commit comments

Comments
 (0)