Skip to content

Commit 3185776

Browse files
committed
Adds harvesting error handling tests and update changelog
1 parent 085bfa6 commit 3185776

File tree

6 files changed

+338
-7
lines changed

6 files changed

+338
-7
lines changed

CHANGELOG.md

Lines changed: 37 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,27 +4,58 @@
44

55
### Added
66

7-
- ...
7+
- Django management command `harvest_journals` for harvesting real OAI-PMH journal sources
8+
- Support for ESSD, AGILE-GISS, and GEO-LEO journals
9+
- Command-line options for journal selection, record limits, and source creation
10+
- Detailed progress reporting with colored output
11+
- Statistics for spatial/temporal metadata extraction
12+
- Integration tests for real journal harvesting (`tests/test_real_harvesting.py`)
13+
- 6 tests covering ESSD, AGILE-GISS, GEO-LEO, and EssOAr
14+
- Tests skipped by default (use `SKIP_REAL_HARVESTING=0` to enable)
15+
- Max records parameter to limit harvesting for testing
16+
- Comprehensive error handling tests for OAI-PMH harvesting (`HarvestingErrorTests`)
17+
- 10 test cases covering malformed XML, missing metadata, HTTP errors, network timeouts
18+
- Test fixtures for various error conditions in `tests/harvesting/error_cases/`
19+
- Verification of graceful error handling and logging
20+
- pytest configuration with custom markers (`pytest.ini`)
21+
- `real_harvesting` marker for integration tests
22+
- Configuration for Django test discovery
823

924
### Changed
1025

11-
- ...
26+
- Fixed OAI-PMH harvesting test failures by updating response format parameters
27+
- Changed from invalid 'structured'/'raw' to valid 'geojson'/'wkt'/'wkb' formats
28+
- Updated test assertions to expect GeoJSON FeatureCollection
29+
- Fixed syntax errors in `publications/tasks.py`
30+
- Fixed import statement typo
31+
- Fixed indentation in `extract_timeperiod_from_html` function
32+
- Fixed misplaced return statement in `regenerate_geopackage_cache` function
33+
- Fixed test setup method in `tests/test_harvesting.py`
34+
- Removed incorrect `@classmethod` decorator from `setUp` method
35+
- Fixed `test_regular_harvesting.py` to include `max_records` parameter in mock function
36+
- Updated README.md with comprehensive documentation for:
37+
- Integration test execution
38+
- `harvest_journals` management command usage
39+
- Journal harvesting workflows
1240

1341
### Fixed
1442

15-
- ...
43+
- Docker build for geoextent installation (added git dependency to Dockerfile)
44+
- 18 geoextent API test failures due to invalid response format values
45+
- 8 test setup errors in OAI-PMH harvesting tests
46+
- Test harvesting function signature mismatch
1647

1748
### Deprecated
1849

19-
- ...
50+
- None.
2051

2152
### Removed
2253

23-
- ...
54+
- None.
2455

2556
### Security
2657

27-
- ...
58+
- None.
2859

2960
## [0.2.0] - 2025-10-09
3061

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/">
3+
<responseDate>2022-07-04T15:37:56Z</responseDate>
4+
<request verb="ListRecords" metadataPrefix="oai_dc">http://localhost:8330/index.php/opti-geo/oai</request>
5+
<ListRecords>
6+
<!-- No records -->
7+
</ListRecords>
8+
</OAI-PMH>
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<NotOAIPMH>
3+
<SomeRandomElement>This is not a valid OAI-PMH response</SomeRandomElement>
4+
<records>
5+
<item>Invalid structure</item>
6+
</records>
7+
</NotOAIPMH>
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/">
3+
<responseDate>2022-07-04T15:37:56Z</responseDate>
4+
<request verb="ListRecords" metadataPrefix="oai_dc">http://localhost:8330/index.php/opti-geo/oai</request>
5+
<ListRecords>
6+
<record>
7+
<header>
8+
<identifier>oai:ojs2.localhost:8330:article/1</identifier>
9+
<datestamp>2022-07-01T12:59:33Z</datestamp>
10+
</header>
11+
<metadata>
12+
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/">
13+
<dc:title>Malformed Record</dc:title>
14+
<!-- Missing closing tag for dc:title -->
15+
</oai_dc:dc>
16+
</metadata>
17+
<!-- Missing closing tag for record -->
18+
</ListRecords>
19+
<!-- Missing closing tag for OAI-PMH -->
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/">
3+
<responseDate>2022-07-04T15:37:56Z</responseDate>
4+
<request verb="ListRecords" metadataPrefix="oai_dc">http://localhost:8330/index.php/opti-geo/oai</request>
5+
<ListRecords>
6+
<record>
7+
<header>
8+
<identifier>oai:ojs2.localhost:8330:article/1</identifier>
9+
<datestamp>2022-07-01T12:59:33Z</datestamp>
10+
</header>
11+
<metadata>
12+
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
13+
xmlns:dc="http://purl.org/dc/elements/1.1/">
14+
<!-- Missing required title -->
15+
<dc:identifier>http://example.com/article/1</dc:identifier>
16+
<dc:description>A publication with no title</dc:description>
17+
<dc:date>2022-07-01</dc:date>
18+
</oai_dc:dc>
19+
</metadata>
20+
</record>
21+
<record>
22+
<header>
23+
<identifier>oai:ojs2.localhost:8330:article/2</identifier>
24+
<datestamp>2022-07-01T12:59:33Z</datestamp>
25+
</header>
26+
<metadata>
27+
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
28+
xmlns:dc="http://purl.org/dc/elements/1.1/">
29+
<dc:title>Record with minimal metadata</dc:title>
30+
<dc:identifier>http://example.com/article/2</dc:identifier>
31+
<!-- Missing date, description, etc -->
32+
</oai_dc:dc>
33+
</metadata>
34+
</record>
35+
</ListRecords>
36+
</OAI-PMH>

tests/test_harvesting.py

Lines changed: 231 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
django.setup()
1111

1212
from publications.models import Publication, Source, HarvestingEvent, Schedule
13-
from publications.tasks import parse_oai_xml_and_save_publications
13+
from publications.tasks import parse_oai_xml_and_save_publications, harvest_oai_endpoint
1414
from django.contrib.auth import get_user_model
1515

1616
User = get_user_model()
@@ -310,3 +310,233 @@ def test_real_journal_harvesting_agile_giss(self):
310310
# Skip test if AGILE doesn't have OAI-PMH endpoint
311311
self.skipTest(f"AGILE-GISS endpoint not available: {e}")
312312

313+
314+
class HarvestingErrorTests(TestCase):
315+
"""
316+
Test cases for error handling during harvesting.
317+
318+
These tests verify that the harvesting system properly handles:
319+
- Malformed XML
320+
- Empty responses
321+
- Missing required metadata
322+
- Invalid XML structure
323+
- Network/HTTP errors
324+
"""
325+
326+
def setUp(self):
327+
"""Set up test sources and events."""
328+
Publication.objects.all().delete()
329+
self.source = Source.objects.create(
330+
url_field="http://example.com/oai",
331+
harvest_interval_minutes=60,
332+
name="Error Test Source"
333+
)
334+
335+
def test_malformed_xml(self):
336+
"""Test that malformed XML is handled gracefully."""
337+
event = HarvestingEvent.objects.create(
338+
source=self.source,
339+
status="in_progress"
340+
)
341+
342+
malformed_xml_path = BASE_TEST_DIR / 'harvesting' / 'error_cases' / 'malformed_xml.xml'
343+
xml_bytes = malformed_xml_path.read_bytes()
344+
345+
# Should not raise exception, but should log error
346+
parse_oai_xml_and_save_publications(xml_bytes, event)
347+
348+
# No publications should be created from malformed XML
349+
pub_count = Publication.objects.filter(job=event).count()
350+
self.assertEqual(pub_count, 0, "Malformed XML should not create publications")
351+
352+
def test_empty_response(self):
353+
"""Test that empty OAI-PMH response (no records) is handled."""
354+
event = HarvestingEvent.objects.create(
355+
source=self.source,
356+
status="in_progress"
357+
)
358+
359+
empty_xml_path = BASE_TEST_DIR / 'harvesting' / 'error_cases' / 'empty_response.xml'
360+
xml_bytes = empty_xml_path.read_bytes()
361+
362+
# Should not raise exception
363+
parse_oai_xml_and_save_publications(xml_bytes, event)
364+
365+
# No publications should be created from empty response
366+
pub_count = Publication.objects.filter(job=event).count()
367+
self.assertEqual(pub_count, 0, "Empty response should create zero publications")
368+
369+
def test_invalid_xml_structure(self):
370+
"""Test that non-OAI-PMH XML structure is handled."""
371+
event = HarvestingEvent.objects.create(
372+
source=self.source,
373+
status="in_progress"
374+
)
375+
376+
invalid_xml_path = BASE_TEST_DIR / 'harvesting' / 'error_cases' / 'invalid_xml_structure.xml'
377+
xml_bytes = invalid_xml_path.read_bytes()
378+
379+
# Should not raise exception
380+
parse_oai_xml_and_save_publications(xml_bytes, event)
381+
382+
# No publications should be created from invalid structure
383+
pub_count = Publication.objects.filter(job=event).count()
384+
self.assertEqual(pub_count, 0, "Invalid XML structure should create zero publications")
385+
386+
def test_missing_required_metadata(self):
387+
"""Test that records with missing required fields are handled."""
388+
event = HarvestingEvent.objects.create(
389+
source=self.source,
390+
status="in_progress"
391+
)
392+
393+
missing_metadata_path = BASE_TEST_DIR / 'harvesting' / 'error_cases' / 'missing_metadata.xml'
394+
xml_bytes = missing_metadata_path.read_bytes()
395+
396+
# Should not raise exception - may create some publications
397+
parse_oai_xml_and_save_publications(xml_bytes, event)
398+
399+
# Check what was created
400+
pubs = Publication.objects.filter(job=event)
401+
402+
# At least one record (the one with title) should be created
403+
self.assertGreaterEqual(pubs.count(), 1, "Should create publications even with minimal metadata")
404+
405+
# Check that publications were created despite missing fields
406+
for pub in pubs:
407+
# Title might be None for some records
408+
if pub.title:
409+
self.assertIsInstance(pub.title, str)
410+
411+
def test_empty_content(self):
412+
"""Test that empty/None content is handled."""
413+
event = HarvestingEvent.objects.create(
414+
source=self.source,
415+
status="in_progress"
416+
)
417+
418+
# Test with empty bytes
419+
parse_oai_xml_and_save_publications(b"", event)
420+
pub_count = Publication.objects.filter(job=event).count()
421+
self.assertEqual(pub_count, 0, "Empty content should create zero publications")
422+
423+
# Test with whitespace only
424+
parse_oai_xml_and_save_publications(b" \n\t ", event)
425+
pub_count = Publication.objects.filter(job=event).count()
426+
self.assertEqual(pub_count, 0, "Whitespace-only content should create zero publications")
427+
428+
@responses.activate
429+
def test_http_404_error(self):
430+
"""Test that HTTP 404 errors are handled properly."""
431+
# Mock a 404 response
432+
responses.add(
433+
responses.GET,
434+
'http://example.com/oai-404',
435+
status=404,
436+
body='Not Found'
437+
)
438+
439+
source = Source.objects.create(
440+
url_field="http://example.com/oai-404",
441+
harvest_interval_minutes=60
442+
)
443+
444+
# harvest_oai_endpoint should handle the error
445+
harvest_oai_endpoint(source.id)
446+
447+
# Check that event was marked as failed
448+
event = HarvestingEvent.objects.filter(source=source).latest('started_at')
449+
self.assertEqual(event.status, 'failed', "Event should be marked as failed for 404 error")
450+
451+
@responses.activate
452+
def test_http_500_error(self):
453+
"""Test that HTTP 500 errors are handled properly."""
454+
# Mock a 500 response
455+
responses.add(
456+
responses.GET,
457+
'http://example.com/oai-500',
458+
status=500,
459+
body='Internal Server Error'
460+
)
461+
462+
source = Source.objects.create(
463+
url_field="http://example.com/oai-500",
464+
harvest_interval_minutes=60
465+
)
466+
467+
# harvest_oai_endpoint should handle the error
468+
harvest_oai_endpoint(source.id)
469+
470+
# Check that event was marked as failed
471+
event = HarvestingEvent.objects.filter(source=source).latest('started_at')
472+
self.assertEqual(event.status, 'failed', "Event should be marked as failed for 500 error")
473+
474+
@responses.activate
475+
def test_network_timeout(self):
476+
"""Test that network timeouts are handled properly."""
477+
from requests.exceptions import Timeout
478+
479+
# Mock a timeout
480+
responses.add(
481+
responses.GET,
482+
'http://example.com/oai-timeout',
483+
body=Timeout('Connection timeout')
484+
)
485+
486+
source = Source.objects.create(
487+
url_field="http://example.com/oai-timeout",
488+
harvest_interval_minutes=60
489+
)
490+
491+
# harvest_oai_endpoint should handle the timeout
492+
harvest_oai_endpoint(source.id)
493+
494+
# Check that event was marked as failed
495+
event = HarvestingEvent.objects.filter(source=source).latest('started_at')
496+
self.assertEqual(event.status, 'failed', "Event should be marked as failed for timeout")
497+
498+
@responses.activate
499+
def test_invalid_xml_in_http_response(self):
500+
"""Test that invalid XML in HTTP response is handled."""
501+
# Mock response with invalid XML
502+
responses.add(
503+
responses.GET,
504+
'http://example.com/oai-invalid',
505+
status=200,
506+
body='This is not XML at all',
507+
content_type='text/xml'
508+
)
509+
510+
source = Source.objects.create(
511+
url_field="http://example.com/oai-invalid",
512+
harvest_interval_minutes=60
513+
)
514+
515+
# Should complete but create no publications
516+
harvest_oai_endpoint(source.id)
517+
518+
event = HarvestingEvent.objects.filter(source=source).latest('started_at')
519+
# Should complete (not fail) but create no publications
520+
self.assertEqual(event.status, 'completed', "Event should complete even with invalid XML")
521+
522+
pub_count = Publication.objects.filter(job=event).count()
523+
self.assertEqual(pub_count, 0, "Invalid XML should create zero publications")
524+
525+
def test_max_records_limit_with_errors(self):
526+
"""Test that max_records works even when some records cause errors."""
527+
event = HarvestingEvent.objects.create(
528+
source=self.source,
529+
status="in_progress"
530+
)
531+
532+
# Use the missing metadata file which has 2 records, one problematic
533+
missing_metadata_path = BASE_TEST_DIR / 'harvesting' / 'error_cases' / 'missing_metadata.xml'
534+
xml_bytes = missing_metadata_path.read_bytes()
535+
536+
# Limit to 1 record
537+
parse_oai_xml_and_save_publications(xml_bytes, event, max_records=1)
538+
539+
# Should process only 1 record
540+
pub_count = Publication.objects.filter(job=event).count()
541+
self.assertLessEqual(pub_count, 1, "Should respect max_records limit even with errors")
542+

0 commit comments

Comments
 (0)