Replies: 1 comment
-
Hey @geoffreya, I've done my fair share of scraping, and unless you use a sophisticated scraping service that cycles your IPs and covers up your crawling - you'll get blocked fairly soon. I tried https://www.scraperapi.com/, and it worked well. Yes, you'd need to train the scraping agent to do this. I have minimal experience; see answer 1. I know that free scrapers are a waste of time. If I understand you correctly, you are trying to create a service that will provide more truthful results for your e-commerce queries? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
As everyone can see, Amazon returns a whole lot of junk search results to sift through (manually). I don't think big etailers are ever going to stop returning excessive "similar" search results because it probably serves some profit motive for Amazon and their hosted sellers this way. Not to fear, we can use this against them!!! The Haystack pipeline can help to fix this, maybe, as follows. The high recall part is what etailers are already doing, almost. The haystack retriever in this case, would mainly just reformat the web page's search results by webscraping them into a nice uniform tabular format, by design which my subsequent haystack reader can understand directly. Next, a good AI haystack reader would rank and return only the best shopping results to the shopper. High precision is the value that the subsequent AI haystack reader will be adding to the shopping process which is new here. Such a reader would incorporate explicitly both hard requirements and soft requirements that a shopper may have, that is, both exact match (token matching like SQL does) and similarity match (like vector databases do). Because sometimes a person will really need to buy a D-cell battery, for example, but definitely not a C-cell or AA-cell nor all these other sizes of batteries which Amazon is returning as being "similar" from its high recall recommender.
A web browser add-in could contain the haystack pipeline I'm talking about. Because if you unwisely instead put this functionality into a web site or web app to do the same thing, the problem is that you are painting a big red target on your back for the big retailer web sites to try to stop you from doing it. But, a distributed web scraper on everyone's computer who has a browser would be much harder for big etailers to focus any potential retaliations against this kind of software.
First question, for the above plan to be implemented, has anyone seen yet, that an AI model like maybe GPTx is good enough to do web scraping in a haystack retriever, if given enough prompt engineering? Because it seems like hand-coding a web scraper is just drudgery in 2023 with all our new NLP AI models. Furthermore the big websites tend to introduce changes to their HTML output just to break would-be scrapers that are hand-coded, and maybe an AI-based scraper would quickly find workarounds for and be running correctly again in less time than humans can fix the code.
Second question is, are big etailer sites returning only partial, not complete, search results on each server round trip? This would require a workaround that performs repeated clicks on "Next" button which complicates the retriever.
Third question is, are big etailer sites aggressively stopping bots from scraping their results? I know youtube was giving me a very hard time when I wrote some code to scrape video titles and video subtitles, and maybe Amazon and ebay are also going to shut this down, or will they let me run my bot to scrape their results witthout trying to stop my webscraper? Keep in mind it's not heavy DOS style scraping, it's just a scrape on web results that one single shopper wants at a time, which is not heavy hammering at all.
Thanks for reading and discussing!
GeoffreyA
Beta Was this translation helpful? Give feedback.
All reactions