Skip to content

[Discussion] Provide DSpace tools or better documentation regarding managing aggressive bots / harvestersΒ #4565

@tdonohue

Description

@tdonohue

Is your feature request related to a problem? Please describe.

This ticket comes out of discussion with the Google Scholar team in July 2025. Google Scholar is noticing more and more repositories are accidentally blocking Google Scholar bots while attempting to alleviate the aggressive behaviors of other bots (especially AI bots). The Google Scholar team understands that sites have to find ways to address aggressive bots, but asked if we could brainstorm whether there are tools or documentation we could add to help DSpace sites make better decisions (and hopefully alleviate some of the common bot-related issues).

A few brainstorms include:

  1. Could DSpace improve our documentation around Apache (or similar) to document recommended tools/services/settings that can be used to block or rate-limit aggressive bots? This documentation might be added to our Performance Tuning DSpace guide, but it'd require one or more sites being willing to share their knowledge for how they've successively dealt with aggressive bots.
  2. Could DSpace have a basic rate limiter that can be better configured to block aggressive IPs, "/24" ranges, or even via geolocation? This wouldn't solve the entire issue, but might help sites to avoid some basic aggressive bot behaviors.
    • This might be an improved "rateLimiter", which already exists in our UI configuration. Our current setup is very basic and uses express-rate-limit. Are there ways to enhance that (whether using the same library or a different one)?

Describe the solution you'd like

It doesn't matter whether the solution is built into DSpace or is just better documentation/recommendations. The main goal is to provide individual DSpace sites with better advice / hints / tips, so that they aren't all "rebuilding the wheel" (and sometimes accidentally blocking Google Scholar or other "good bots" while doing so).

We can start small, and build upon that into better solutions. For example, what is something basic that DSpace sites can do which has even a small benefit for alleviating the effect of aggressive bots (or blocking some of them)?

Additional information

Related discussions on dealing with aggressive bots include:

Metadata

Metadata

Assignees

No one assigned

    Labels

    component: SEOSearch Engine OptimizationdocumentationTicket describes improvements or additions to documentationhelp wantedNeeds a volunteer to claim to move forwardnew featureperformance / cachingRelated to performance, caching or embedded objects

    Projects

    Status

    πŸ“‹ To Do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions