How to Automatically Retrieve Web Resources That Require On-Site Search to Locate, Instead of Just Recursively Crawling URLs? #3995

HarideP · 2026-03-21T13:56:43Z

HarideP
Mar 21, 2026

I would like to ask a question related to data acquisition.

Recently, I have been working on a requirement: to batch collect files, documents, or historical materials from websites that were published before a specific year.

At first, I tested some common web crawlers and “one-click download” tools, and found that most of them are only suitable for the following scenarios:

Given a starting URL
Automatically extract links from the page
Recursively crawl based on a defined depth
Download pages or files encountered along the way

This approach works for “static directory-style” websites, but most of the sites I encounter in practice are not structured this way.

The real issue is:
On many websites, resources are not exposed through a navigable page structure. Instead, they are stored in databases and can only be located by performing on-site searches, applying filters, and navigating through paginated results before reaching the resource detail or download pages. In other words, these resources are not “discovered through link navigation,” but rather “discovered through search behavior.”

A hypothetical example:
On platforms like JSTOR, I can find target resources (e.g., free-access materials) through the site’s search function. However, if I start crawling from the homepage (https://www.jstor.org/) using a conventional crawler, I would never reach those resource pages. In some cases, even the parent paths of these target links are not publicly accessible. This suggests that the entry points to these resources are “exposed after search,” rather than “exposed through web navigation.”

So I would like to discuss the following:

Are there any mature tools or frameworks designed for this type of website?
How do people typically implement similar requirements?

I currently have a few ideas:

Using Playwright / Selenium to simulate human search behavior
RPA / browser automation
Analyzing the website’s internal search APIs
Using AI agents to operate web pages automatically

For sites like academic databases, archives, or government data platforms—where resources are “hidden inside databases”—are there any established best practices to follow?

I am increasingly convinced that the real challenge in this type of problem is not “downloading,” but “discovery”—that is, how to automate the process of performing on-site searches, filtering results, and locating resources in place of a human.

If anyone has worked on similar projects or is aware of relevant tools, papers, open-source frameworks, or engineering experience, I would greatly appreciate your insights. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Automatically Retrieve Web Resources That Require On-Site Search to Locate, Instead of Just Recursively Crawling URLs? #3995

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to Automatically Retrieve Web Resources That Require On-Site Search to Locate, Instead of Just Recursively Crawling URLs? #3995

Uh oh!

HarideP Mar 21, 2026

Replies: 0 comments

HarideP
Mar 21, 2026