robots.txt: Allow only minimal access to REST API endpoints for crawlers

## Is your feature request related to a problem? Please describe.
DSpace 7+ sites can get a lot of unwanted crawling requests against rest api endpoints that are irrelevant for search engines to index, as they essentially only should be getting the item pages and bitstreams.

An example of an affected repository, getting millions of crawling requests, and TB of bandwidth consumption over a 90 day period, and where you see that the requests for (rest api) JSON are +50% of all requests

<img width="964" alt="Image" src="https://github.com/user-attachments/assets/44e81586-553e-454c-b648-8011c161e0c5" />

<img width="478" alt="Image" src="https://github.com/user-attachments/assets/3218328b-339d-471a-a356-d82fb3565b08" />

## Updated proposal 2025-09-11 Describe the solution you'd like

```
# Crawlers should get minimal accesss to the rest api, allowing to crawl item pages with javascript and download bitstreams
Allow: /server/api
Allow: /server/api/core/bitstreams/
Allow: /server/api/authn/status
Allow: /server/api/security/csrf
Disallow: /server/api/
Disallow: /server/opensearch/
Disallow: /server/oai/
```
Improvements
* More permissive to allow robots to crawl/render javascript
* dropping the use of *, as this is not an official operator. The trailing slash is what is important to allow sub-paths (very important for core/bitstreams/)


## ORIGINAL PROPOSAL FLAWED 2025-06-17 Describe the solution you'd like
The following could be added to robots.txt

```
Allow: /server/api/core/bitstreams
Disallow: /server/api/*
Disallow: /server/api
Disallow: /server/opensearch/*
Disallow: /server/oai/*
```

## Additional context

For reference, see these docs on the use of the Allow directive, confirming that it can be used to effectively override (broader) disallow directives:
https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

robots.txt: Allow only minimal access to REST API endpoints for crawlers #4481

Is your feature request related to a problem? Please describe.

Updated proposal 2025-09-11 Describe the solution you'd like

ORIGINAL PROPOSAL FLAWED 2025-06-17 Describe the solution you'd like

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

robots.txt: Allow only minimal access to REST API endpoints for crawlers #4481

Description

Is your feature request related to a problem? Please describe.

Updated proposal 2025-09-11 Describe the solution you'd like

ORIGINAL PROPOSAL FLAWED 2025-06-17 Describe the solution you'd like

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions