Skip to content

Commit ef3d827

Browse files
עידן וילנסקיעידן וילנסקי
authored andcommitted
v1.1.1: Update documentation with comprehensive feature coverage
- Add crawl(), parse_content(), and connect_browser() examples to README - Document all client parameters including browser automation and logging - Update environment variables for browser credentials - Fix browser connection example import and URL issues
1 parent aefa14a commit ef3d827

File tree

6 files changed

+164
-32
lines changed

6 files changed

+164
-32
lines changed

.github/workflows/test.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ on:
55
branches: [ main, develop ]
66
pull_request:
77
branches: [ main ]
8+
schedule:
9+
- cron: '0 2 * * *'
810

911
jobs:
1012
test:
@@ -43,9 +45,10 @@ jobs:
4345

4446
test-pypi-package:
4547
runs-on: ubuntu-latest
48+
if: github.event_name == 'schedule'
4649
strategy:
4750
matrix:
48-
python-version: ['3.8', '3.11'] # Test on fewer versions for PyPI to save CI time
51+
python-version: ['3.8', '3.11']
4952

5053
steps:
5154
- uses: actions/checkout@v4

README.md

Lines changed: 137 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -7,19 +7,6 @@ pip install brightdata-sdk
77
<h3 align="center">Python SDK by Bright Data, Easy-to-use scalable methods for web search & scraping</h3>
88
<p></p>
99

10-
## Features
11-
12-
| Feature | Functions | Description
13-
|--------------------------|-----------------------------|-------------------------------------
14-
| **Scrape every website** | `scrape` | Scrape every website using Bright's scraping and unti bot-detection capabilities
15-
| **Web search** | `search` | Search google and other search engines by query (supports batch searches)
16-
| **Search chatGPT** | `search_chatGPT` | Prompt chatGPT and scrape its answers, support multiple inputs and follow-up prompts
17-
| **Search linkedin** | `search_linkedin.posts()`, `search_linkedin.jobs()`, `search_linkedin.profiles()` | Search LinkedIn by specific queries, and recieve structured data
18-
| **Scrape linkedin** | `scrape_linkedin.posts()`, `scrape_linkedin.jobs()`, `scrape_linkedin.profiles()`, `scrape_linkedin.companies()` | Scrape LinkedIn and recieve structured data
19-
| **Download functions** | `download_snapshot`, `download_content` | Download content for both sync and async requests
20-
| **Client class** | `bdclient` | Handles authentication, automatic zone creation and managment, and options for robust error handling
21-
| **Parallel processing** | **all functions** | All functions use Concurrent processing for multiple URLs or queries, and support multiple Output Formats
22-
2310
## Installation
2411
To install the package, open your terminal:
2512

@@ -32,15 +19,39 @@ pip install brightdata-sdk
3219

3320
Create a [Bright Data](https://brightdata.com/) account and copy your API key
3421

35-
### 1. Initialize the Client
22+
### Initialize the Client
3623

3724
```python
3825
from brightdata import bdclient
3926

4027
client = bdclient(api_token="your_api_token_here") # can also be defined as BRIGHTDATA_API_TOKEN in your .env file
4128
```
4229

43-
### 2. Try usig one of the functions
30+
### Launch first request
31+
Add to your code a serp function
32+
```python
33+
results = client.search("best selling shoes")
34+
35+
print(client.parse_content(results))
36+
```
37+
38+
## Features
39+
40+
| Feature | Functions | Description
41+
|--------------------------|-----------------------------|-------------------------------------
42+
| **Scrape every website** | `scrape` | Scrape every website using Bright's scraping and unti bot-detection capabilities
43+
| **Web search** | `search` | Search google and other search engines by query (supports batch searches)
44+
| **Web crawling** | `crawl` | Discover and scrape multiple pages from websites with advanced filtering and depth control
45+
| **Content parsing** | `parse_content` | Extract text, links, images and structured data from API responses (JSON or HTML)
46+
| **Browser automation** | `connect_browser` | Get WebSocket endpoint for Playwright/Selenium integration with Bright Data's scraping browser
47+
| **Search chatGPT** | `search_chatGPT` | Prompt chatGPT and scrape its answers, support multiple inputs and follow-up prompts
48+
| **Search linkedin** | `search_linkedin.posts()`, `search_linkedin.jobs()`, `search_linkedin.profiles()` | Search LinkedIn by specific queries, and recieve structured data
49+
| **Scrape linkedin** | `scrape_linkedin.posts()`, `scrape_linkedin.jobs()`, `scrape_linkedin.profiles()`, `scrape_linkedin.companies()` | Scrape LinkedIn and recieve structured data
50+
| **Download functions** | `download_snapshot`, `download_content` | Download content for both sync and async requests
51+
| **Client class** | `bdclient` | Handles authentication, automatic zone creation and managment, and options for robust error handling
52+
| **Parallel processing** | **all functions** | All functions use Concurrent processing for multiple URLs or queries, and support multiple Output Formats
53+
54+
### Try usig one of the functions
4455

4556
#### `Search()`
4657
```python
@@ -108,6 +119,56 @@ results = client.scrape_linkedin.posts(post_urls) # can also be changed to async
108119
print(results) # will print the snapshot_id, which can be downloaded using the download_snapshot() function
109120
```
110121

122+
#### `crawl()`
123+
```python
124+
# Single URL crawl with filters
125+
result = client.crawl(
126+
url="https://example.com/",
127+
depth=2,
128+
filter="/product/", # Only crawl URLs containing "/product/"
129+
exclude_filter="/ads/", # Exclude URLs containing "/ads/"
130+
custom_output_fields=["markdown", "url", "page_title"]
131+
)
132+
print(f"Crawl initiated. Snapshot ID: {result['snapshot_id']}")
133+
134+
# Download crawl results
135+
data = client.download_snapshot(result['snapshot_id'])
136+
```
137+
138+
#### `parse_content()`
139+
```python
140+
# Parse scraping results
141+
scraped_data = client.scrape("https://example.com")
142+
parsed = client.parse_content(
143+
scraped_data,
144+
extract_text=True,
145+
extract_links=True,
146+
extract_images=True
147+
)
148+
print(f"Title: {parsed['title']}")
149+
print(f"Text length: {len(parsed['text'])}")
150+
print(f"Found {len(parsed['links'])} links")
151+
```
152+
153+
#### `connect_browser()`
154+
```python
155+
# For Playwright (default browser_type)
156+
from playwright.sync_api import sync_playwright
157+
158+
client = bdclient(
159+
api_token="your_api_token",
160+
browser_username="username-zone-browser_zone1",
161+
browser_password="your_password"
162+
)
163+
164+
with sync_playwright() as playwright:
165+
browser = playwright.chromium.connect_over_cdp(client.connect_browser())
166+
page = browser.new_page()
167+
page.goto("https://example.com")
168+
print(f"Title: {page.title()}")
169+
browser.close()
170+
```
171+
111172
**`download_content`** (for sync requests)
112173
```python
113174
data = client.scrape("https://example.com")
@@ -154,6 +215,50 @@ Scrapes a single URL or list of URLs using the Web Unlocker.
154215
- `timeout`: Request timeout in seconds (default: 30)
155216
```
156217

218+
</details>
219+
<details>
220+
<summary>🕷️ <strong>crawl(...)</strong></summary>
221+
222+
Discover and scrape multiple pages from websites with advanced filtering.
223+
224+
```python
225+
- `url`: Single URL string or list of URLs to crawl (required)
226+
- `ignore_sitemap`: Ignore sitemap when crawling (optional)
227+
- `depth`: Maximum crawl depth relative to entered URL (optional)
228+
- `filter`: Regex to include only certain URLs (e.g. "/product/")
229+
- `exclude_filter`: Regex to exclude certain URLs (e.g. "/ads/")
230+
- `custom_output_fields`: List of output fields to include (optional)
231+
- `include_errors`: Include errors in response (default: True)
232+
```
233+
234+
</details>
235+
<details>
236+
<summary>🔍 <strong>parse_content(...)</strong></summary>
237+
238+
Extract and parse useful information from API responses.
239+
240+
```python
241+
- `data`: Response data from scrape(), search(), or crawl() methods
242+
- `extract_text`: Extract clean text content (default: True)
243+
- `extract_links`: Extract all links from content (default: False)
244+
- `extract_images`: Extract image URLs from content (default: False)
245+
```
246+
247+
</details>
248+
<details>
249+
<summary>🌐 <strong>connect_browser(...)</strong></summary>
250+
251+
Get WebSocket endpoint for browser automation with Bright Data's scraping browser.
252+
253+
```python
254+
# Required client parameters:
255+
- `browser_username`: Username for browser API (format: "username-zone-{zone_name}")
256+
- `browser_password`: Password for browser API authentication
257+
- `browser_type`: "playwright", "puppeteer", or "selenium" (default: "playwright")
258+
259+
# Returns: WebSocket endpoint URL string
260+
```
261+
157262
</details>
158263
<details>
159264
<summary>💾 <strong>Download_Content(...)</strong></summary>
@@ -191,8 +296,11 @@ Create a `.env` file in your project root:
191296

192297
```env
193298
BRIGHTDATA_API_TOKEN=your_bright_data_api_token
194-
WEB_UNLOCKER_ZONE=your_web_unlocker_zone # Optional
195-
SERP_ZONE=your_serp_zone # Optional
299+
WEB_UNLOCKER_ZONE=your_web_unlocker_zone # Optional
300+
SERP_ZONE=your_serp_zone # Optional
301+
BROWSER_ZONE=your_browser_zone # Optional
302+
BRIGHTDATA_BROWSER_USERNAME=username-zone-name # For browser automation
303+
BRIGHTDATA_BROWSER_PASSWORD=your_browser_password # For browser automation
196304
```
197305

198306
</details>
@@ -223,14 +331,21 @@ client = bdclient(
223331
<details>
224332
<summary>👥 <strong>Client Management</strong></summary>
225333

226-
bdclient Class
334+
bdclient Class - Complete parameter list
227335

228336
```python
229337
bdclient(
230-
api_token: str = None,
231-
auto_create_zones: bool = True,
232-
web_unlocker_zone: str = None,
233-
serp_zone: str = None,
338+
api_token: str = None, # Your Bright Data API token (required)
339+
auto_create_zones: bool = True, # Auto-create zones if they don't exist
340+
web_unlocker_zone: str = None, # Custom web unlocker zone name
341+
serp_zone: str = None, # Custom SERP zone name
342+
browser_zone: str = None, # Custom browser zone name
343+
browser_username: str = None, # Browser API username (format: "username-zone-{zone_name}")
344+
browser_password: str = None, # Browser API password
345+
browser_type: str = "playwright", # Browser automation tool: "playwright", "puppeteer", "selenium"
346+
log_level: str = "INFO", # Logging level: "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"
347+
structured_logging: bool = True, # Use structured JSON logging
348+
verbose: bool = None # Enable verbose logging (overrides log_level if True)
234349
)
235350
```
236351

brightdata/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@
6464
)
6565
from .utils import parse_content, parse_multiple, extract_structured_data
6666

67-
__version__ = "1.1.0"
67+
__version__ = "1.1.1"
6868
__author__ = "Bright Data"
6969
__email__ = "[email protected]"
7070

brightdata/client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -594,7 +594,7 @@ def connect_browser(self) -> str:
594594
api_token="your_token",
595595
browser_username="username-zone-browser_zone1",
596596
browser_password="your_password",
597-
browser_type="playwright" # or omit for default
597+
browser_type="playwright" # Playwright/ Puppeteer (default)
598598
)
599599
endpoint_url = client.connect_browser() # Returns: wss://[email protected]:9222
600600

examples/browser_connection_example.py

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
33

44
from brightdata import bdclient
5-
from playwright.sync_api import sync_playwright
5+
from playwright.sync_api import sync_playwright, Playwright
66

77
client = bdclient(
88
api_token="your-api-key",
@@ -11,9 +11,23 @@
1111
browser_zone="your-custom-browser-zone"
1212
) # Hover over the function to see browser parameters (can also be taken from .env file)
1313

14-
with sync_playwright() as playwright:
14+
def scrape(playwright: Playwright, url="https://example.com"):
1515
browser = playwright.chromium.connect_over_cdp(client.connect_browser()) # Connect to the browser using Bright Data's endpoint
16-
page = browser.new_page()
17-
page.goto("https://example.com")
18-
print(f"Title: {page.title()}")
19-
browser.close()
16+
try:
17+
print(f'Connected! Navigating to {url}...')
18+
page = browser.new_page()
19+
page.goto(url, timeout=2*60_000)
20+
print('Navigated! Scraping page content...')
21+
data = page.content()
22+
print(f'Scraped! Data: {data}')
23+
finally:
24+
browser.close()
25+
26+
27+
def main():
28+
with sync_playwright() as playwright:
29+
scrape(playwright)
30+
31+
32+
if __name__ == '__main__':
33+
main()

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "brightdata-sdk"
7-
version = "1.1.0"
7+
version = "1.1.1"
88
description = "Python SDK for Bright Data Web Scraping and SERP APIs"
99
authors = [
1010
{name = "Bright Data", email = "[email protected]"}

0 commit comments

Comments
 (0)