TS-Walmart-Scraper is a web scraping tool developed in TypeScript on top of the Crawlee library. It allows users to extract relevant data points from products on walmart.com. The scraper can be used with inputs such as category URLs, brand URLs, search keywords or specific product URLs.
Clone the repository
git clone https://github.com/<username>/TS-Walmart-Scraper.git
cd TS-Walmart-Scraper
Install dependencies
npm install
The input for the scraper is a JSON file named INPUT.json, which should be located in the following directory: project_folder\storage\key_value_stores\default\. The INPUT.json file should contain the following fields:
productUrls: An array of URLs for specific product pages to scrape.listingUrls: An array of URLs for category pages or brand pages to scrape (that contains listing of products and pagination).keywords: An array of search keywords to use when searching Walmart.com.maxPrice: The maximum price for products to scrape.minPrice: The minimum price for products to scrape.startPageNumber: The page number to start scraping from.finalPageNumber: The final page number to scrape.
Using 0 as value for minPrice and maxPrice indicates the scraper to collect products from all price ranges.
Using 0 as value for startPageNumber and finalPageNumber indicates the scraper to crawl all the page range.
To run the scraper, navigate to the project directory in your terminal and run the following command:
npm start
The output of the scraper will be a series of JSON files, one per product scraped, and will be located in the following directory: project_folder\storage\datasets\default.
The output JSON files from TS-Walmart-Scraper includes all the following fields:
URL: The URL of the product page.idCodes: An object containing the unique identifier codes of the product, including theSKUandUPC.seller: An object containing information about the seller and brand of the product, including thebrand,brandURL,seller, andsellerURL.title: The title of the product.media: An object containing URLs for images and videos of the product, including themainimage URL,galleryarray of image URLs, andvideosarray of video objects, each with atitleandurlfield.pricing: An object containing pricing information for the product, including thesalePrice,fullPrice, andcurrencySymbol.isAvailable: A boolean indicating whether the product is currently available.isGiftEligible: A boolean indicating whether the product is eligible for gift-giving.isUsed: A boolean indicating whether the product is used.rating: An object containing rating information for the product and seller, including theitemRating,itemReviews,sellerRating, andsellerReviews.orderLimits: An object containing minimum and maximum order limits for the product, including theminandmaxfields.category: An object containing information about the category of the product, including thefullPathandpathPartsarray of category objects, each with anameandurlfield.info: An object containing additional information about the product, including theshortDescription,longDescription, andspecificationsarray of objects, each with anattributeandvaluefield.variants: An array containing information about different variants of the product, including objects with aisCurrentVariant,url,SKU,isAvailable,pricing, andoptionsfield. Theoptionsfield contains an array of objects, each with anattributeandvaluefield.
This project is licensed under the AGPL-3.0 license License. See the LICENSE file for details.
I hope you find this software useful and I would be honored if you fork this repository and collaborate with me to improve it. If you have any suggestions or find any bugs, please don't hesitate to open an issue or submit a pull request. Thanks for using TS-Walmart-Scraper!