Skip to content

Commit 114372d

Browse files
committed
feat: refactoring of the dependencies
1 parent fa2a4aa commit 114372d

File tree

8 files changed

+71
-14
lines changed

8 files changed

+71
-14
lines changed

.agent/system/project_architecture.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,14 @@ scrapegraph-sdk/
8585
- **aiohttp** 3.10+ - Async HTTP client
8686
- **pydantic** 2.10.2+ - Data validation and modeling
8787
- **python-dotenv** 1.0.1+ - Environment variable management
88-
- **beautifulsoup4** 4.12.3+ - HTML parsing (for pagination)
88+
89+
**Optional Dependencies:**
90+
- **beautifulsoup4** 4.12.3+ - HTML parsing (for HTML validation when using `website_html`)
91+
- Install with: `pip install scrapegraph-py[html]`
92+
- **langchain** 0.3.0+ - Langchain integration for AI workflows
93+
- **langchain-community** 0.2.11+ - Community integrations for Langchain
94+
- **langchain-scrapegraph** 0.1.0+ - ScrapeGraph integration for Langchain
95+
- Install with: `pip install scrapegraph-py[langchain]`
8996

9097
**Development Tools:**
9198
- **pytest** 7.4.0+ - Testing framework
@@ -879,12 +886,17 @@ npm publish
879886

880887
### Python SDK Dependencies
881888

882-
**Runtime:**
889+
**Core Runtime:**
883890
- **requests**: Sync HTTP client
884891
- **aiohttp**: Async HTTP client
885892
- **pydantic**: Data validation
886893
- **python-dotenv**: Environment variables
887-
- **beautifulsoup4**: HTML parsing
894+
895+
**Optional Runtime (install with extras):**
896+
- **beautifulsoup4**: HTML parsing (required when using `website_html`)
897+
- Install with: `pip install scrapegraph-py[html]`
898+
- **langchain, langchain-community, langchain-scrapegraph**: Langchain integration
899+
- Install with: `pip install scrapegraph-py[langchain]`
888900

889901
**Development:**
890902
- **pytest & plugins**: Testing framework
@@ -918,7 +930,7 @@ Both SDKs depend on the ScrapeGraph AI API:
918930
| **Architecture** | Class-based (Client, AsyncClient) | Function-based |
919931
| **Async Support** | ✅ Separate AsyncClient | ✅ All functions async |
920932
| **Type Safety** | ✅ Pydantic models, mypy | ⚠️ JSDoc comments |
921-
| **Dependencies** | 5 runtime deps | 0 runtime deps |
933+
| **Dependencies** | 4 core + 2 optional extras | 0 runtime deps |
922934
| **Testing** | pytest with mocking | Manual tests |
923935
| **Documentation** | MkDocs auto-generated | README examples |
924936
| **Package Size** | ~50KB | ~20KB |

.github/workflows/python-publish.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ jobs:
3535
python -m pip install --upgrade pip
3636
pip install pytest pytest-asyncio responses
3737
cd scrapegraph-py
38-
pip install -e .
38+
pip install -e ".[html]"
3939
4040
- name: Run mocked tests with coverage
4141
run: |

.github/workflows/test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ jobs:
3434
python -m pip install --upgrade pip
3535
pip install pytest pytest-asyncio responses
3636
cd scrapegraph-py
37-
pip install -e .
37+
pip install -e ".[html]"
3838
3939
- name: Run mocked tests with coverage
4040
run: |

CLAUDE.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,10 @@ scrapegraph-sdk/
4444
### Python SDK
4545
- **Language**: Python 3.10+
4646
- **Package Manager**: uv (recommended) or pip
47-
- **Dependencies**: requests, pydantic, python-dotenv, aiohttp, beautifulsoup4
47+
- **Core Dependencies**: requests, pydantic, python-dotenv, aiohttp
48+
- **Optional Dependencies**:
49+
- `html`: beautifulsoup4 (for HTML validation when using `website_html`)
50+
- `langchain`: langchain, langchain-community, langchain-scrapegraph (for Langchain integrations)
4851
- **Testing**: pytest, pytest-asyncio, pytest-mock, aioresponses
4952
- **Code Quality**: black, isort, ruff, mypy, pre-commit
5053
- **Documentation**: mkdocs, mkdocs-material

scrapegraph-py/README.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,33 @@ Official [Python SDK ](https://scrapegraphai.com) for the ScrapeGraph API - Smar
1414

1515
## 📦 Installation
1616

17+
### Basic Installation
18+
1719
```bash
1820
pip install scrapegraph-py
1921
```
2022

23+
This installs the core SDK with minimal dependencies. The SDK is fully functional with just the core dependencies.
24+
25+
### Optional Dependencies
26+
27+
For specific use cases, you can install optional extras:
28+
29+
**HTML Validation** (required when using `website_html` parameter):
30+
```bash
31+
pip install scrapegraph-py[html]
32+
```
33+
34+
**Langchain Integration** (for using with Langchain/Langgraph):
35+
```bash
36+
pip install scrapegraph-py[langchain]
37+
```
38+
39+
**All Optional Dependencies**:
40+
```bash
41+
pip install scrapegraph-py[html,langchain]
42+
```
43+
2144
## 🚀 Features
2245

2346
- 🤖 AI-powered web scraping and search
@@ -58,6 +81,7 @@ response = client.smartscraper(
5881
)
5982

6083
# Or using HTML content
84+
# Note: Using website_html requires the [html] extra: pip install scrapegraph-py[html]
6185
html_content = """
6286
<html>
6387
<body>

scrapegraph-py/TESTING.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,11 @@ Install test dependencies:
3939
```bash
4040
cd scrapegraph-py
4141
pip install -r requirements-test.txt
42-
pip install -e .
42+
pip install -e ".[html]"
4343
```
4444

45+
**Note**: Tests require the `html` extra to be installed because they test HTML validation features. The `[html]` extra includes `beautifulsoup4` which is used for HTML validation in `SmartScraperRequest`.
46+
4547
### Basic Test Execution
4648

4749
```bash
@@ -255,7 +257,7 @@ The `pytest.ini` file configures:
255257

256258
1. **Import Errors**
257259
```bash
258-
pip install -e .
260+
pip install -e ".[html]"
259261
```
260262

261263
2. **Missing Dependencies**

scrapegraph-py/pyproject.toml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@ name = "scrapegraph_py"
33
version = "1.12.2"
44
description = "ScrapeGraph Python SDK for API"
55
authors = [
6-
{ name = "Marco Vinciguerra", email = "mvincig11@gmail.com" },
7-
{ name = "Lorenzo Padoan", email = "lorenzo.padoan977@gmail.com" }
6+
{ name = "Marco Vinciguerra", email = "marco@scrapegraphai.com" },
7+
{ name = "Lorenzo Padoan", email = "lorenzo@scrapegraphai.com" }
88
]
99

1010

@@ -41,11 +41,15 @@ dependencies = [
4141
"pydantic>=2.10.2",
4242
"python-dotenv>=1.0.1",
4343
"aiohttp>=3.10",
44-
"requests>=2.32.3",
45-
"beautifulsoup4>=4.12.3",
4644
]
4745

4846
[project.optional-dependencies]
47+
html = ["beautifulsoup4>=4.12.3"]
48+
langchain = [
49+
"langchain>=0.3.0",
50+
"langchain-community>=0.2.11",
51+
"langchain-scrapegraph>=0.1.0",
52+
]
4953
docs = ["sphinx==6.0", "furo==2024.5.6"]
5054

5155
[tool.uv]

scrapegraph-py/scrapegraph_py/models/smartscraper.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,12 @@
1515
from typing import Dict, Optional, Type
1616
from uuid import UUID
1717

18-
from bs4 import BeautifulSoup
18+
try:
19+
from bs4 import BeautifulSoup
20+
HAS_BS4 = True
21+
except ImportError:
22+
HAS_BS4 = False
23+
1924
from pydantic import BaseModel, Field, conint, model_validator
2025

2126

@@ -122,11 +127,18 @@ def validate_url_and_html(self) -> "SmartScraperRequest":
122127
if self.website_html is not None:
123128
if len(self.website_html.encode("utf-8")) > 2 * 1024 * 1024:
124129
raise ValueError("Website HTML content exceeds maximum size of 2MB")
130+
if not HAS_BS4:
131+
raise ImportError(
132+
"beautifulsoup4 is required for HTML validation. "
133+
"Install it with: pip install scrapegraph-py[html] or pip install beautifulsoup4"
134+
)
125135
try:
126136
soup = BeautifulSoup(self.website_html, "html.parser")
127137
if not soup.find():
128138
raise ValueError("Invalid HTML - no parseable content found")
129139
except Exception as e:
140+
if isinstance(e, ImportError):
141+
raise
130142
raise ValueError(f"Invalid HTML structure: {str(e)}")
131143

132144
# Validate URL

0 commit comments

Comments
 (0)