You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to ask a question related to data acquisition.
Recently, I have been working on a requirement: to batch collect files, documents, or historical materials from websites that were published before a specific year.
At first, I tested some common web crawlers and “one-click download” tools, and found that most of them are only suitable for the following scenarios:
Given a starting URL
Automatically extract links from the page
Recursively crawl based on a defined depth
Download pages or files encountered along the way
This approach works for “static directory-style” websites, but most of the sites I encounter in practice are not structured this way.
The real issue is:
On many websites, resources are not exposed through a navigable page structure. Instead, they are stored in databases and can only be located by performing on-site searches, applying filters, and navigating through paginated results before reaching the resource detail or download pages. In other words, these resources are not “discovered through link navigation,” but rather “discovered through search behavior.”
A hypothetical example:
On platforms like JSTOR, I can find target resources (e.g., free-access materials) through the site’s search function. However, if I start crawling from the homepage (https://www.jstor.org/) using a conventional crawler, I would never reach those resource pages. In some cases, even the parent paths of these target links are not publicly accessible. This suggests that the entry points to these resources are “exposed after search,” rather than “exposed through web navigation.”
So I would like to discuss the following:
Are there any mature tools or frameworks designed for this type of website?
How do people typically implement similar requirements?
I currently have a few ideas:
Using Playwright / Selenium to simulate human search behavior
RPA / browser automation
Analyzing the website’s internal search APIs
Using AI agents to operate web pages automatically
For sites like academic databases, archives, or government data platforms—where resources are “hidden inside databases”—are there any established best practices to follow?
I am increasingly convinced that the real challenge in this type of problem is not “downloading,” but “discovery”—that is, how to automate the process of performing on-site searches, filtering results, and locating resources in place of a human.
If anyone has worked on similar projects or is aware of relevant tools, papers, open-source frameworks, or engineering experience, I would greatly appreciate your insights. Thank you.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I would like to ask a question related to data acquisition.
Recently, I have been working on a requirement: to batch collect files, documents, or historical materials from websites that were published before a specific year.
At first, I tested some common web crawlers and “one-click download” tools, and found that most of them are only suitable for the following scenarios:
This approach works for “static directory-style” websites, but most of the sites I encounter in practice are not structured this way.
The real issue is:
On many websites, resources are not exposed through a navigable page structure. Instead, they are stored in databases and can only be located by performing on-site searches, applying filters, and navigating through paginated results before reaching the resource detail or download pages. In other words, these resources are not “discovered through link navigation,” but rather “discovered through search behavior.”
A hypothetical example:
On platforms like JSTOR, I can find target resources (e.g., free-access materials) through the site’s search function. However, if I start crawling from the homepage (https://www.jstor.org/) using a conventional crawler, I would never reach those resource pages. In some cases, even the parent paths of these target links are not publicly accessible. This suggests that the entry points to these resources are “exposed after search,” rather than “exposed through web navigation.”
So I would like to discuss the following:
I currently have a few ideas:
For sites like academic databases, archives, or government data platforms—where resources are “hidden inside databases”—are there any established best practices to follow?
I am increasingly convinced that the real challenge in this type of problem is not “downloading,” but “discovery”—that is, how to automate the process of performing on-site searches, filtering results, and locating resources in place of a human.
If anyone has worked on similar projects or is aware of relevant tools, papers, open-source frameworks, or engineering experience, I would greatly appreciate your insights. Thank you.
Beta Was this translation helpful? Give feedback.
All reactions