`w2md` Web to Markdown Scraper

w2md scrapes specific content from a list of URLs and converts it into a Markdown file. It is useful for extracting documentation or content from web pages in a format that can be easily used by other tools, such as Language Models (LLMs).

Features

Scrape content from a list of URLs (provided in a text file).
Select specific HTML elements by class, ID, or tag name.
Exclude certain elements within the target content.
Convert the selected content to Markdown.
Save the final concatenated Markdown content to a file.

Requirements

Python 3.x
The following Python libraries:
- requests
- beautifulsoup4
- markdownify
- rich

You can install all required dependencies using pip:

pip install -r requirements.txt

Usage

The script takes several command-line arguments to define what content to scrape and how to process it.

Command-line Arguments

--urls or -u: The path to the file containing the URLs, one per line. (Default: urls)
--target or -t: The class, ID, or HTML element to scrape content from (e.g., .content, #main, div). (Required)
--exclude or -x: A list of classes, IDs, or HTML elements to exclude from the scraped content. Supports multiple selectors.
--output or -o: The name of the output Markdown file. (Default: output.md)

Example Usage

python w2md.py -u urls.txt -t ".content" -x ".social-share,#sidebar" -o documentation.md

This command scrapes the content from URLs listed in urls.txt, targeting elements with class .content, excluding elements with class .social-share and ID #sidebar, and saving the output to documentation.md.

Notes

The script waits 3 seconds between each request to prevent overwhelming the servers being scraped.
Make sure to respect the terms of service of any website you're scraping, as automated scraping may be prohibited.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
shell.nix		shell.nix
w2md.py		w2md.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`w2md` Web to Markdown Scraper

Features

Requirements

Usage

Command-line Arguments

Example Usage

Notes

About

Uh oh!

Uh oh!

Languages

mwmdev/w2md

Folders and files

Latest commit

History

Repository files navigation

w2md Web to Markdown Scraper

Features

Requirements

Usage

Command-line Arguments

Example Usage

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

`w2md` Web to Markdown Scraper