A simple command-line tool for scraping HTML content from a given URL and extracting data for further processing.
Good for grabbing all image/link/magnet URLs from a page, or extracting the text of certain elements.
NOT for scraping entire page content.
npm i -g @trippnology/super-simple-scraper
-
Clone the repository:
git clone https://github.com/trippnology/super-simple-scraper.git
-
Navigate to the project directory:
cd super-simple-scraper -
Install dependencies:
npm install
-
Make the script executable (optional, for Unix-based systems):
chmod +x index.js
-
Link the repo as a local command (optional):
npm link
You can now run
sssglobally, as if it was installed by npm.
You can run the scraper using the following command:
sss [options]Or if you installed from source:
node index.js [options]-u, --url <url>: The URL to scrape (required).-s, --selector <selector>: CSS selector to find. Default isa.-c, --content <type>: Process each element as this type of content (hash,html,image,json,link,object, ortext). Default islink.-o, --output <format>: Output format (html,json,object, ortext). Default istext.
It's up to you to use sensible combinations of options. If you select all images, then try to process them as links, you're not going to get any results!
Use the -c object and -o object options together to get the full cheerio object for debugging. You can use this to make sure you are dealing with the DOM that you think you are!
-
Scrape a specific URL with default options: (this will find all links and return their hrefs)
sss -u https://example.com
-
Find all elements with a class of
.fooand grab their HTML contents:sss -u https://example.com -s .foo -c html
-
Find all links and return their href:
sss -u http://localhost:8080/test.html -s a -c link
-
Find all links and return their text:
sss -u http://localhost:8080/test.html -s a -c text
-
Find all images and return their src:
sss -u http://localhost:8080/test.html -s img -c image
-
Find all magnet links and return their infohash:
sss -u http://localhost:8080/test.html -s a[href^=magnet] -c hash
-
Find all scripts containing JSON and return their contents:
sss -u http://localhost:8080/test.html -s script[type="application/json"] -c json
-
Find all elements with a class of
.fooand return the full cheerio object (useful for debugging):sss -u http://localhost:8080/test.html -s .foo -c object -f object
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature - Commit your changes:
git commit -am 'Add some feature' - Push to the branch:
git push origin my-new-feature - Submit a pull request :D
- v1.0.0: Initial release with basic functionality.
MIT See the full LICENSE file