-
Notifications
You must be signed in to change notification settings - Fork 504
Open
Labels
help wantedNeeds a volunteer to claim to move forwardNeeds a volunteer to claim to move forwardneeds discussionneeds triageNew issue needs triage and/or schedulingNew issue needs triage and/or schedulingnew feature
Description
Is your feature request related to a problem? Please describe.
DSpace 7+ sites can get a lot of unwanted crawling requests against rest api endpoints that are irrelevant for search engines to index, as they essentially only should be getting the item pages and bitstreams.
An example of an affected repository, getting millions of crawling requests, and TB of bandwidth consumption over a 90 day period, and where you see that the requests for (rest api) JSON are +50% of all requests
Updated proposal 2025-09-11 Describe the solution you'd like
# Crawlers should get minimal accesss to the rest api, allowing to crawl item pages with javascript and download bitstreams
Allow: /server/api
Allow: /server/api/core/bitstreams/
Allow: /server/api/authn/status
Allow: /server/api/security/csrf
Disallow: /server/api/
Disallow: /server/opensearch/
Disallow: /server/oai/
Improvements
- More permissive to allow robots to crawl/render javascript
- dropping the use of *, as this is not an official operator. The trailing slash is what is important to allow sub-paths (very important for core/bitstreams/)
ORIGINAL PROPOSAL FLAWED 2025-06-17 Describe the solution you'd like
The following could be added to robots.txt
Allow: /server/api/core/bitstreams
Disallow: /server/api/*
Disallow: /server/api
Disallow: /server/opensearch/*
Disallow: /server/oai/*
Additional context
For reference, see these docs on the use of the Allow directive, confirming that it can be used to effectively override (broader) disallow directives:
https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
Metadata
Metadata
Assignees
Labels
help wantedNeeds a volunteer to claim to move forwardNeeds a volunteer to claim to move forwardneeds discussionneeds triageNew issue needs triage and/or schedulingNew issue needs triage and/or schedulingnew feature
Type
Projects
Status
👀 Needs Discussion / Analysis