Skip to content

robots.txt: Allow only minimal access to REST API endpoints for crawlers #4481

@bram-atmire

Description

@bram-atmire

Is your feature request related to a problem? Please describe.

DSpace 7+ sites can get a lot of unwanted crawling requests against rest api endpoints that are irrelevant for search engines to index, as they essentially only should be getting the item pages and bitstreams.

An example of an affected repository, getting millions of crawling requests, and TB of bandwidth consumption over a 90 day period, and where you see that the requests for (rest api) JSON are +50% of all requests

Image Image

Updated proposal 2025-09-11 Describe the solution you'd like

# Crawlers should get minimal accesss to the rest api, allowing to crawl item pages with javascript and download bitstreams
Allow: /server/api
Allow: /server/api/core/bitstreams/
Allow: /server/api/authn/status
Allow: /server/api/security/csrf
Disallow: /server/api/
Disallow: /server/opensearch/
Disallow: /server/oai/

Improvements

  • More permissive to allow robots to crawl/render javascript
  • dropping the use of *, as this is not an official operator. The trailing slash is what is important to allow sub-paths (very important for core/bitstreams/)

ORIGINAL PROPOSAL FLAWED 2025-06-17 Describe the solution you'd like

The following could be added to robots.txt

Allow: /server/api/core/bitstreams
Disallow: /server/api/*
Disallow: /server/api
Disallow: /server/opensearch/*
Disallow: /server/oai/*

Additional context

For reference, see these docs on the use of the Allow directive, confirming that it can be used to effectively override (broader) disallow directives:
https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    👀 Needs Discussion / Analysis

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions