Skip to content

Commit 6b88199

Browse files
authored
Add files via upload
1 parent 4a23a6d commit 6b88199

File tree

11 files changed

+1833
-0
lines changed

11 files changed

+1833
-0
lines changed

docs/api.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# API Reference
2+
3+
Complete reference for the EasyScrape API.
4+
5+
## Core Functions
6+
7+
### `scrape()`
8+
9+
```python
10+
es.scrape(url: str, **options) -> ScrapeResult
11+
```
12+
13+
Fetch a URL and return a result object.
14+
15+
**Parameters:**
16+
17+
| Parameter | Type | Default | Description |
18+
|-----------|------|---------|-------------|
19+
| `url` | `str` | required | The URL to fetch |
20+
| `method` | `str` | `"GET"` | HTTP method |
21+
| `headers` | `dict` | `None` | Custom headers |
22+
| `timeout` | `float` | `30.0` | Request timeout in seconds |
23+
| `retries` | `int` | `3` | Number of retry attempts |
24+
| `follow_redirects` | `bool` | `True` | Follow HTTP redirects |
25+
26+
**Returns:** `ScrapeResult` object
27+
28+
**Example:**
29+
30+
```python
31+
result = es.scrape(
32+
"https://example.com",
33+
timeout=10,
34+
headers={"Accept-Language": "en-GB"}
35+
)
36+
```
37+
38+
---
39+
40+
### `async_scrape()`
41+
42+
```python
43+
await es.async_scrape(url: str, **options) -> ScrapeResult
44+
```
45+
46+
Async version of `scrape()`. Same parameters.
47+
48+
**Example:**
49+
50+
```python
51+
import asyncio
52+
import easyscrape as es
53+
54+
async def main():
55+
result = await es.async_scrape("https://example.com")
56+
print(result.title())
57+
58+
asyncio.run(main())
59+
```
60+
61+
---
62+
63+
## ScrapeResult
64+
65+
The result object returned by `scrape()` and `async_scrape()`.
66+
67+
### Properties
68+
69+
| Property | Type | Description |
70+
|----------|------|-------------|
71+
| `status_code` | `int` | HTTP status code |
72+
| `text` | `str` | Response body as text |
73+
| `content` | `bytes` | Response body as bytes |
74+
| `headers` | `dict` | Response headers |
75+
| `url` | `str` | Final URL (after redirects) |
76+
77+
### Methods
78+
79+
#### `css(selector: str) -> str | None`
80+
81+
Extract the text of the first matching element.
82+
83+
```python
84+
title = result.css("h1")
85+
```
86+
87+
#### `css_all(selector: str) -> list[str]`
88+
89+
Extract text from all matching elements.
90+
91+
```python
92+
items = result.css_all("li.item")
93+
```
94+
95+
#### `json() -> dict | list`
96+
97+
Parse response as JSON.
98+
99+
```python
100+
data = result.json()
101+
```
102+
103+
#### `title() -> str | None`
104+
105+
Get the page title.
106+
107+
#### `main_text() -> str`
108+
109+
Extract main content, stripped of navigation and boilerplate.
110+
111+
#### `safe_links() -> list[str]`
112+
113+
Get all links, filtered to remove unsafe protocols.
114+
115+
---
116+
117+
## Configuration
118+
119+
### `Config`
120+
121+
```python
122+
from easyscrape import Config
123+
124+
config = Config(
125+
timeout=60,
126+
retries=5,
127+
user_agent="MyBot/1.0"
128+
)
129+
130+
result = es.scrape("https://example.com", config=config)
131+
```
132+
133+
See [Configuration Guide](configuration.md) for details.

docs/async.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Async Scraping
2+
3+
Scrape multiple URLs concurrently for maximum speed.
4+
5+
## Basic Async
6+
7+
```python
8+
import asyncio
9+
import easyscrape as es
10+
11+
async def main():
12+
result = await es.async_scrape("https://example.com")
13+
print(result.title())
14+
15+
asyncio.run(main())
16+
```
17+
18+
## Concurrent Requests
19+
20+
Scrape multiple URLs in parallel:
21+
22+
```python
23+
import asyncio
24+
import easyscrape as es
25+
26+
async def scrape_all(urls: list[str]):
27+
tasks = [es.async_scrape(url) for url in urls]
28+
results = await asyncio.gather(*tasks)
29+
return results
30+
31+
urls = [
32+
"https://example.com/page1",
33+
"https://example.com/page2",
34+
"https://example.com/page3",
35+
]
36+
37+
results = asyncio.run(scrape_all(urls))
38+
39+
for result in results:
40+
print(f"{result.url}: {result.title()}")
41+
```
42+
43+
## Rate-Limited Concurrency
44+
45+
Control the number of simultaneous requests:
46+
47+
```python
48+
import asyncio
49+
import easyscrape as es
50+
51+
async def scrape_with_limit(urls: list[str], max_concurrent: int = 5):
52+
semaphore = asyncio.Semaphore(max_concurrent)
53+
54+
async def limited_scrape(url: str):
55+
async with semaphore:
56+
return await es.async_scrape(url)
57+
58+
tasks = [limited_scrape(url) for url in urls]
59+
return await asyncio.gather(*tasks)
60+
```
61+
62+
## Error Handling
63+
64+
Handle failures gracefully:
65+
66+
```python
67+
import asyncio
68+
import easyscrape as es
69+
70+
async def safe_scrape(url: str):
71+
try:
72+
return await es.async_scrape(url)
73+
except es.ScrapeError as e:
74+
print(f"Failed: {url} - {e}")
75+
return None
76+
77+
async def main():
78+
urls = ["https://example.com", "https://invalid.example"]
79+
tasks = [safe_scrape(url) for url in urls]
80+
results = await asyncio.gather(*tasks)
81+
82+
successful = [r for r in results if r is not None]
83+
print(f"Succeeded: {len(successful)}/{len(urls)}")
84+
```
85+
86+
## With Async Config
87+
88+
```python
89+
from easyscrape import Config
90+
91+
config = Config(timeout=10, retries=2)
92+
93+
async def main():
94+
result = await es.async_scrape(
95+
"https://example.com",
96+
config=config
97+
)
98+
```
99+
100+
## Best Practices
101+
102+
1. **Limit concurrency** - Don't overwhelm servers. Use semaphores.
103+
2. **Handle errors** - Network requests fail. Plan for it.
104+
3. **Respect robots.txt** - Check before bulk scraping.
105+
4. **Add delays** - Use `rate_limit` in config for politeness.

docs/browser.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Browser Mode
2+
3+
Handle JavaScript-rendered pages with browser automation.
4+
5+
## When to Use
6+
7+
Use browser mode when:
8+
9+
- Content is loaded via JavaScript
10+
- Page requires interaction (clicks, scrolls)
11+
- Site blocks non-browser requests
12+
- You need screenshots
13+
14+
## Basic Usage
15+
16+
```python
17+
import easyscrape as es
18+
19+
# Enable browser mode
20+
result = es.scrape(
21+
"https://example.com",
22+
browser=True
23+
)
24+
25+
# Works just like regular scraping
26+
title = result.css("h1")
27+
```
28+
29+
## Wait for Content
30+
31+
Wait for elements to appear:
32+
33+
```python
34+
result = es.scrape(
35+
"https://example.com",
36+
browser=True,
37+
wait_for="div.content-loaded"
38+
)
39+
```
40+
41+
## JavaScript Execution
42+
43+
Run JavaScript on the page:
44+
45+
```python
46+
result = es.scrape(
47+
"https://example.com",
48+
browser=True,
49+
js_script="window.scrollTo(0, document.body.scrollHeight)"
50+
)
51+
```
52+
53+
## Screenshots
54+
55+
Capture page screenshots:
56+
57+
```python
58+
result = es.scrape(
59+
"https://example.com",
60+
browser=True,
61+
screenshot="page.png"
62+
)
63+
```
64+
65+
## Browser Options
66+
67+
```python
68+
result = es.scrape(
69+
"https://example.com",
70+
browser=True,
71+
headless=True, # Run without visible window (default)
72+
timeout=60, # Page load timeout
73+
wait_for="h1", # CSS selector to wait for
74+
js_script=None, # JavaScript to execute
75+
screenshot=None, # Path to save screenshot
76+
)
77+
```
78+
79+
## Async Browser
80+
81+
```python
82+
import asyncio
83+
import easyscrape as es
84+
85+
async def main():
86+
result = await es.async_scrape(
87+
"https://example.com",
88+
browser=True
89+
)
90+
print(result.title())
91+
92+
asyncio.run(main())
93+
```
94+
95+
## Performance Tips
96+
97+
1. **Reuse sessions** - Browser startup is slow. Batch requests.
98+
2. **Disable images** - Faster loads when you only need text.
99+
3. **Use headless** - Always use headless mode in production.
100+
4. **Set timeouts** - Prevent hangs on slow pages.
101+
102+
## Limitations
103+
104+
- Slower than HTTP requests (browser overhead)
105+
- Higher memory usage
106+
- Requires browser dependencies
107+
108+
For most sites, regular HTTP scraping is sufficient. Use browser mode only when needed.

docs/changelog.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Changelog
2+
3+
All notable changes to EasyScrape.
4+
5+
## [0.1.0] - 2024
6+
7+
### Added
8+
9+
- Initial release
10+
- Core `scrape()` function with automatic retries
11+
- CSS selector extraction with `css()` and `css_all()`
12+
- Async support with `async_scrape()`
13+
- Browser mode for JavaScript-rendered pages
14+
- Built-in helpers: `title()`, `main_text()`, `safe_links()`
15+
- Configuration system with `Config` class
16+
- Rate limiting support
17+
- Proxy support
18+
- Full type hints (PEP 561 compliant)
19+
20+
### Documentation
21+
22+
- Quick start guide
23+
- Complete tutorial
24+
- API reference
25+
- Cookbook with real-world recipes

0 commit comments

Comments
 (0)