Skip to content

Commit f305434

Browse files
feat: nitrowebfetch (#183)
Added small program that allows to fetch content from websites. It allows to fetch a full website content, or just part of it thanks to the "selector" parameter. Also it allows to specify te format of output. Can be HTML or Markdown.
1 parent 638ce66 commit f305434

File tree

10 files changed

+275
-9
lines changed

10 files changed

+275
-9
lines changed
Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,17 @@
1-
name: "[nitrodigest] Publish to PyPI"
1+
name: "Publish Package to PyPI"
22
permissions:
33
contents: read
44

55
on:
66
workflow_dispatch:
77
inputs:
8+
package:
9+
description: "Package to publish"
10+
required: true
11+
type: choice
12+
options:
13+
- Nitrodigest
14+
- Nitrowebfetch
815
environment:
916
description: "Where to publish (testpypi or pypi)"
1017
required: true
@@ -14,13 +21,9 @@ on:
1421
- testpypi
1522
- pypi
1623

17-
defaults:
18-
run:
19-
working-directory: Projects/Nitrodigest
20-
2124
jobs:
2225
build-and-publish:
23-
name: Build and publish Python package
26+
name: Build and publish ${{ inputs.package }}
2427
runs-on: ubuntu-latest
2528

2629
environment: ${{ inputs.environment }}
@@ -42,11 +45,11 @@ jobs:
4245
pip install build
4346
4447
- name: Build package
48+
working-directory: Projects/${{ inputs.package }}
4549
run: python -m build
4650

4751
- name: Publish package to ${{ inputs.environment }}
4852
uses: pypa/gh-action-pypi-publish@release/v1
49-
5053
with:
5154
repository-url: ${{ inputs.environment == 'pypi' && 'https://upload.pypi.org/legacy/' || 'https://test.pypi.org/legacy/' }}
52-
packages-dir: Projects/Nitrodigest/dist/
55+
packages-dir: Projects/${{ inputs.package }}/dist/

Projects/Nitrowebfetch/MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
include README.md

Projects/Nitrowebfetch/README.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# NitroWebfetch
2+
3+
Extract web content, cleanly.
4+
5+
**NitroWebfetch – the developer‑friendly web content extractor with CSS selectors.**
6+
7+
This project is in alpha phase.
8+
9+
## Features
10+
11+
- Extracts content from web pages using CSS selectors
12+
- Converts HTML to clean Markdown format
13+
- Fallback selectors for maximum compatibility
14+
- Command-line interface with various options
15+
- Built on Playwright for reliable web scraping
16+
- Completely free (open source, MIT license)
17+
18+
## Ideas for next steps
19+
20+
- Add support for multiple output formats (JSON, plain text)
21+
- Batch processing for multiple URLs
22+
- Custom user-agent and headers configuration
23+
- Integration with NitroDigest for web page summarization
24+
- Support for authentication and cookies
25+
- Content filtering and cleaning options
26+
27+
---
28+
29+
## Usage
30+
31+
### Prerequisites
32+
33+
To run this tool, you need to have [Python](https://www.python.org/downloads/) installed on your local machine.
34+
35+
### Installation
36+
37+
Install NitroWebfetch via pip:
38+
39+
```bash
40+
pip install nitrowebfetch-cli
41+
playwright install firefox
42+
```
43+
44+
For development installation:
45+
46+
```bash
47+
cd Projects/Nitrowebfetch
48+
pip install -e .
49+
playwright install firefox
50+
```
51+
52+
### Basic Usage
53+
54+
Run NitroWebfetch to extract content from web pages:
55+
56+
```bash
57+
nitrowebfetch <url> > <output_file>
58+
```
59+
60+
#### Examples
61+
62+
Extract article content from a webpage and save it to a file:
63+
64+
```bash
65+
nitrowebfetch https://example.com/article > article.md
66+
```
67+
68+
Extract content using a custom CSS selector:
69+
70+
```bash
71+
nitrowebfetch https://example.com --selector ".main-content" > content.md
72+
```
73+
74+
Get HTML output instead of Markdown:
75+
76+
```bash
77+
nitrowebfetch https://example.com --format html > content.html
78+
```
79+
80+
### Command Line Arguments
81+
82+
You can customize the extraction process using command line arguments:
83+
84+
```bash
85+
nitrowebfetch \
86+
--selector ".article-body" \
87+
--format md \
88+
https://example.com
89+
```
90+
91+
Available arguments:
92+
93+
- `url`: URL to fetch content from (required)
94+
- `--selector`: CSS selector to use for content extraction (default: article)
95+
- `--format`: Format of output content - 'md' for Markdown or 'html' for raw HTML (default: md)
96+
97+
### Fallback Selectors
98+
99+
If the primary selector doesn't match any elements, NitroWebfetch automatically tries these alternatives:
100+
101+
- `article`
102+
- `main`
103+
- `.article`
104+
- `.content`
105+
- `#content`
106+
- `.post`
107+
- `.entry-content`
108+
109+
---
110+
111+
## Contributing
112+
113+
Do you want to contribute to this tool? Check the Contributing page:
114+
115+
[Getting started](../../Contributing.md)
116+
117+
## Report an issue
118+
119+
Found an issue? You can easily report it here:
120+
121+
[https://github.com/Frodigo/garage/issues/new](https://github.com/Frodigo/garage/issues/new)
122+
123+
## License
124+
125+
This project is licensed under the MIT License - see the LICENSE file for details.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[build-system]
2+
requires = ["setuptools", "wheel"]
3+
build-backend = "setuptools.build_meta"
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
playwright>=1.55.0
2+
html2text>=2025.4.15

Projects/Nitrowebfetch/setup.cfg

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
[metadata]
2+
name = nitrowebfetch-cli
3+
version = 0.1.0
4+
author = Marcin Kwiatkowski
5+
author_email = marcin@frodigo.com
6+
description = The developer‑friendly web content extractor with CSS selectors.
7+
long_description = file: README.md
8+
long_description_content_type = text/markdown
9+
url = https://github.com/Frodigo/garage/tree/main/Projects/Nitrowebfetch
10+
classifiers =
11+
Programming Language :: Python :: 3
12+
Operating System :: OS Independent
13+
[options]
14+
package_dir =
15+
= src
16+
packages = find:
17+
python_requires = >=3.8
18+
install_requires =
19+
playwright>=1.55.0
20+
html2text>=2025.4.15
21+
22+
[options.packages.find]
23+
where = src
24+
exclude =
25+
__pycache__
26+
__tests__
27+
28+
[options.entry_points]
29+
console_scripts =
30+
nitrowebfetch = nitrowebfetch_cli:main
31+
32+
[project]
33+
license = "MIT"
34+
license_files = ["LICENSE"]
35+
36+
[project.urls]
37+
Homepage = "https://github.com/Frodigo/garage/tree/main/Projects/Nitrowebfetch"
38+
Issues = "https://github.com/Frodigo/garage/issues"
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
"""nitrowebfetch CLI package"""
2+
3+
__version__ = "0.1.0"
4+
5+
from .main import main
6+
7+
__all__ = [
8+
"__version__",
9+
"main",
10+
]
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
import asyncio
2+
import html2text
3+
from argparse import ArgumentParser
4+
from playwright.async_api import async_playwright
5+
6+
7+
def main():
8+
parser = ArgumentParser(
9+
description="nitrowebfetch - Extract content from web pages using CSS selectors",
10+
epilog="Visit docs, if you need more information: https://frodigo.com/projects/Nitrowebfetch/README.md, or report issues: https://github.com/frodigo/garage/issues if something doesn't work as expected."
11+
)
12+
parser.add_argument(
13+
"url",
14+
help="URL to fetch content from"
15+
)
16+
parser.add_argument(
17+
"--selector",
18+
default="article",
19+
help="CSS selector to use for content extraction (default: article)"
20+
)
21+
22+
parser.add_argument(
23+
"--format",
24+
default="md",
25+
help="Format of output content (default: md)"
26+
)
27+
28+
args = parser.parse_args()
29+
30+
asyncio.run(_fetch_page(args.url, args.selector, args.format))
31+
32+
33+
async def _fetch_page(url, selector='article', format='md'):
34+
"""
35+
Fetch specific content from a webpage using CSS selectors
36+
37+
Args:
38+
url: The URL to scrape
39+
selector: CSS selector (default: 'article')
40+
"""
41+
async with async_playwright() as p:
42+
browser = await p.firefox.launch(headless=True)
43+
page = await browser.new_page()
44+
45+
try:
46+
await page.goto(url)
47+
48+
element = await page.query_selector(selector)
49+
if element:
50+
html_content = await element.inner_html()
51+
_render_output(html_content, format)
52+
else:
53+
print(f"No elements found matching selector: '{selector}'")
54+
55+
# Try some common article selectors as alternatives
56+
alternatives = [
57+
'article', 'main', '.article', '.content',
58+
'#content', '.post', '.entry-content'
59+
]
60+
61+
for alt_selector in alternatives:
62+
if alt_selector != selector:
63+
alt_element = await page.query_selector(alt_selector)
64+
if alt_element:
65+
html_content = await alt_element.inner_html()
66+
_render_output(html_content, format)
67+
break
68+
69+
except Exception as e:
70+
print(f"Error fetching page: {e}")
71+
finally:
72+
await browser.close()
73+
74+
75+
def _render_output(html_content, format):
76+
if format == 'md':
77+
print(html2text.html2text(html_content))
78+
return
79+
80+
print(html_content)
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
from nitrowebfetch_cli.main import main
2+
3+
if __name__ == "__main__":
4+
main()

Projects/Testtrack/Skyline GTR - Text Classification/Text Classification.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2162,7 +2162,7 @@
21622162
"name": "python",
21632163
"nbconvert_exporter": "python",
21642164
"pygments_lexer": "ipython3",
2165-
"version": "3.12.10"
2165+
"version": "3.10.12"
21662166
}
21672167
},
21682168
"nbformat": 4,

0 commit comments

Comments
 (0)