This script is designed to fetch the content of one or more URLs, with special handling for Reddit, and convert it into Markdown format. It is useful for extracting content for LLM processing.
The script uses Playwright to launch a headless Chromium browser instance, and opens the target URLs concurrently (controlled by p-limit).
It scrolls the page to trigger lazy-loaded content, and cleans the DOM, removing scripts, styles, etc. before conversion. It uses html-to-md to convert the cleaned HTML body into Markdown.
If the URL is from Reddit, it performs specialized actions to expand "View more comments" buttons and nested replies to capture the full discussion content.
To run this script, you need to npm install the required Node.js dependencies: playwright, commander, clipboardy, html-to-md, and p-limit
You can pass one or more URLs directly as arguments:
./fetch_markdown.mjs https://example.com https://google.comIf no URLs are provided, the script runs in interactive mode. You can enter URLs (one per line) and press Ctrl+D when finished:
-o, --output <file>: Specify output file (default:~/Downloads/markdown_TIMESTAMP.md)-c, --clipboard: Copy results to clipboard automatically-p, --parallel <number>: Max parallel pages (default: 5)
You can run this script directly from Raycast to quickly fetch markdown from URLs in your clipboard or by entering them manually.
- Prerequisites: Ensure you have Node.js installed.
- Script Location: Place
fetch_markdown.mjsin your script commands directory. - Permissions: Ensure the script is executable:
chmod +x fetch_markdown.mjs
- Raycast Configuration:
- The script includes Raycast metadata headers.
- Argument:
URLs (space separated)(Optional). - If no argument is provided, it will attempt to read URLs from your clipboard.
- Output: The script is set to
@raycast.mode fullOutput, so standard output will be displayed in a Raycast window.
- From Clipboard: Copy one or more URLs, open Raycast, and run "Fetch Markdown".
- Manual Input: Open Raycast, run "Fetch Markdown", and type the URLs separated by spaces.