feat: nitrowebfetch (#183)

RetroModernDev · web-flow · commit f30543495442 · 2025-09-15T07:55:24.000+02:00
Added small program that allows to fetch content from websites. It
allows to fetch a full website content, or just part of it thanks to the
"selector" parameter.

Also it allows to specify te format of output. Can be HTML or Markdown.
diff --git a/.github/workflows/publish_package.yml b/.github/workflows/publish_package.yml
@@ -1,10 +1,17 @@
-name: "[nitrodigest] Publish to PyPI"
+name: "Publish Package to PyPI"
 permissions:
   contents: read
 
 on:
   workflow_dispatch:
     inputs:
+      package:
+        description: "Package to publish"
+        required: true
+        type: choice
+        options:
+          - Nitrodigest
+          - Nitrowebfetch
       environment:
         description: "Where to publish (testpypi or pypi)"
         required: true
@@ -14,13 +21,9 @@ on:
           - testpypi
           - pypi
 
-defaults:
-  run:
-    working-directory: Projects/Nitrodigest
-
 jobs:
   build-and-publish:
-    name: Build and publish Python package
+    name: Build and publish ${{ inputs.package }}
     runs-on: ubuntu-latest
 
     environment: ${{ inputs.environment }}
@@ -42,11 +45,11 @@ jobs:
           pip install build
 
       - name: Build package
+        working-directory: Projects/${{ inputs.package }}
         run: python -m build
 
       - name: Publish package to ${{ inputs.environment }}
         uses: pypa/gh-action-pypi-publish@release/v1
-
         with:
           repository-url: ${{ inputs.environment == 'pypi' && 'https://upload.pypi.org/legacy/' || 'https://test.pypi.org/legacy/' }}
-          packages-dir: Projects/Nitrodigest/dist/
+          packages-dir: Projects/${{ inputs.package }}/dist/
diff --git a/Projects/Nitrowebfetch/MANIFEST.in b/Projects/Nitrowebfetch/MANIFEST.in
@@ -0,0 +1 @@
+include README.md
diff --git a/Projects/Nitrowebfetch/README.md b/Projects/Nitrowebfetch/README.md
@@ -0,0 +1,125 @@
+# NitroWebfetch
+
+Extract web content, cleanly.
+
+**NitroWebfetch – the developer‑friendly web content extractor with CSS selectors.**
+
+This project is in alpha phase.
+
+## Features
+
+- Extracts content from web pages using CSS selectors
+- Converts HTML to clean Markdown format
+- Fallback selectors for maximum compatibility
+- Command-line interface with various options
+- Built on Playwright for reliable web scraping
+- Completely free (open source, MIT license)
+
+## Ideas for next steps
+
+- Add support for multiple output formats (JSON, plain text)
+- Batch processing for multiple URLs
+- Custom user-agent and headers configuration
+- Integration with NitroDigest for web page summarization
+- Support for authentication and cookies
+- Content filtering and cleaning options
+
+---
+
+## Usage
+
+### Prerequisites
+
+To run this tool, you need to have [Python](https://www.python.org/downloads/) installed on your local machine.
+
+### Installation
+
+Install NitroWebfetch via pip:
+
+```bash
+pip install nitrowebfetch-cli
+playwright install firefox
+```
+
+For development installation:
+
+```bash
+cd Projects/Nitrowebfetch
+pip install -e .
+playwright install firefox
+```
+
+### Basic Usage
+
+Run NitroWebfetch to extract content from web pages:
+
+```bash
+nitrowebfetch <url> > <output_file>
+```
+
+#### Examples
+
+Extract article content from a webpage and save it to a file:
+
+```bash
+nitrowebfetch https://example.com/article > article.md
+```
+
+Extract content using a custom CSS selector:
+
+```bash
+nitrowebfetch https://example.com --selector ".main-content" > content.md
+```
+
+Get HTML output instead of Markdown:
+
+```bash
+nitrowebfetch https://example.com --format html > content.html
+```
+
+### Command Line Arguments
+
+You can customize the extraction process using command line arguments:
+
+```bash
+nitrowebfetch \
+    --selector ".article-body" \
+    --format md \
+    https://example.com
+```
+
+Available arguments:
+
+- `url`: URL to fetch content from (required)
+- `--selector`: CSS selector to use for content extraction (default: article)
+- `--format`: Format of output content - 'md' for Markdown or 'html' for raw HTML (default: md)
+
+### Fallback Selectors
+
+If the primary selector doesn't match any elements, NitroWebfetch automatically tries these alternatives:
+
+- `article`
+- `main`
+- `.article`
+- `.content`
+- `#content`
+- `.post`
+- `.entry-content`
+
+---
+
+## Contributing
+
+Do you want to contribute to this tool? Check the Contributing page:
+
+[Getting started](../../Contributing.md)
+
+## Report an issue
+
+Found an issue? You can easily report it here:
+
+[https://github.com/Frodigo/garage/issues/new](https://github.com/Frodigo/garage/issues/new)
+
+## License
+
+This project is licensed under the MIT License - see the LICENSE file for details.
diff --git a/Projects/Nitrowebfetch/pyproject.toml b/Projects/Nitrowebfetch/pyproject.toml
@@ -0,0 +1,3 @@
+[build-system]
+requires = ["setuptools", "wheel"]
+build-backend = "setuptools.build_meta"
diff --git a/Projects/Nitrowebfetch/requirements.txt b/Projects/Nitrowebfetch/requirements.txt
@@ -0,0 +1,2 @@
+playwright>=1.55.0
+html2text>=2025.4.15
diff --git a/Projects/Nitrowebfetch/setup.cfg b/Projects/Nitrowebfetch/setup.cfg
@@ -0,0 +1,38 @@
+[metadata]
+name = nitrowebfetch-cli
+version = 0.1.0
+author = Marcin Kwiatkowski
+author_email = marcin@frodigo.com
+description = The developer‑friendly web content extractor with CSS selectors.
+long_description = file: README.md
+long_description_content_type = text/markdown
+url = https://github.com/Frodigo/garage/tree/main/Projects/Nitrowebfetch
+classifiers =
+    Programming Language :: Python :: 3
+    Operating System :: OS Independent
+[options]
+package_dir =
+    = src
+packages = find:
+python_requires = >=3.8
+install_requires =
+    playwright>=1.55.0
+    html2text>=2025.4.15
+
+[options.packages.find]
+where = src
+exclude =
+    __pycache__
+    __tests__
+
+[options.entry_points]
+console_scripts =
+    nitrowebfetch = nitrowebfetch_cli:main
+
+[project]
+license = "MIT"
+license_files = ["LICENSE"]
+
+[project.urls]
+Homepage = "https://github.com/Frodigo/garage/tree/main/Projects/Nitrowebfetch"
+Issues = "https://github.com/Frodigo/garage/issues"
diff --git a/Projects/Nitrowebfetch/src/nitrowebfetch_cli/__init__.py b/Projects/Nitrowebfetch/src/nitrowebfetch_cli/__init__.py
@@ -0,0 +1,10 @@
+"""nitrowebfetch CLI package"""
+
+__version__ = "0.1.0"
+
+from .main import main
+
+__all__ = [
+    "__version__",
+    "main",
+]
diff --git a/Projects/Nitrowebfetch/src/nitrowebfetch_cli/main.py b/Projects/Nitrowebfetch/src/nitrowebfetch_cli/main.py
@@ -0,0 +1,80 @@
+import asyncio
+import html2text
+from argparse import ArgumentParser
+from playwright.async_api import async_playwright
+
+
+def main():
+    parser = ArgumentParser(
+        description="nitrowebfetch - Extract content from web pages using CSS selectors",
+        epilog="Visit docs, if you need more information: https://frodigo.com/projects/Nitrowebfetch/README.md, or report issues: https://github.com/frodigo/garage/issues if something doesn't work as expected."
+    )
+    parser.add_argument(
+        "url",
+        help="URL to fetch content from"
+    )
+    parser.add_argument(
+        "--selector",
+        default="article",
+        help="CSS selector to use for content extraction (default: article)"
+    )
+
+    parser.add_argument(
+        "--format",
+        default="md",
+        help="Format of output content (default: md)"
+    )
+
+    args = parser.parse_args()
+
+    asyncio.run(_fetch_page(args.url, args.selector, args.format))
+
+
+async def _fetch_page(url, selector='article', format='md'):
+    """
+    Fetch specific content from a webpage using CSS selectors
+
+    Args:
+        url: The URL to scrape
+        selector: CSS selector (default: 'article')
+    """
+    async with async_playwright() as p:
+        browser = await p.firefox.launch(headless=True)
+        page = await browser.new_page()
+
+        try:
+            await page.goto(url)
+
+            element = await page.query_selector(selector)
+            if element:
+                html_content = await element.inner_html()
+                _render_output(html_content, format)
+            else:
+                print(f"No elements found matching selector: '{selector}'")
+
+                # Try some common article selectors as alternatives
+                alternatives = [
+                    'article', 'main', '.article', '.content',
+                    '#content', '.post', '.entry-content'
+                ]
+
+                for alt_selector in alternatives:
+                    if alt_selector != selector:
+                        alt_element = await page.query_selector(alt_selector)
+                        if alt_element:
+                            html_content = await alt_element.inner_html()
+                            _render_output(html_content, format)
+                            break
+
+        except Exception as e:
+            print(f"Error fetching page: {e}")
+        finally:
+            await browser.close()
+
+
+def _render_output(html_content, format):
+    if format == 'md':
+        print(html2text.html2text(html_content))
+        return
+
+    print(html_content)
diff --git a/Projects/Nitrowebfetch/src/run-nitrowebfetch-cli.py b/Projects/Nitrowebfetch/src/run-nitrowebfetch-cli.py
@@ -0,0 +1,4 @@
+from nitrowebfetch_cli.main import main
+
+if __name__ == "__main__":
+    main()
diff --git a/Projects/Testtrack/Skyline GTR - Text Classification/Text Classification.ipynb b/Projects/Testtrack/Skyline GTR - Text Classification/Text Classification.ipynb
@@ -2162,7 +2162,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.10"
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+[build-system]`
	`2`	`+requires = ["setuptools", "wheel"]`
	`3`	`+build-backend = "setuptools.build_meta"`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+playwright>=1.55.0`
	`2`	`+html2text>=2025.4.15`
-Original file line number
+Diff line change
@@ @@ -0,0 +1,4 @@ @@
 +from nitrowebfetch_cli.main import main
++
 +if __name__ == "__main__":
 +    main()
Original file line number	Diff line number	Diff line change
`@@ -2162,7 +2162,7 @@`
`2162`	`2162`	`"name": "python",`
`2163`	`2163`	`"nbconvert_exporter": "python",`
`2164`	`2164`	`"pygments_lexer": "ipython3",`
`2165`		`- "version": "3.12.10"`
	`2165`	`+ "version": "3.10.12"`
`2166`	`2166`	`}`
`2167`	`2167`	`},`
`2168`	`2168`	`"nbformat": 4,`