Skip to content

Bright Data's python SDK, use it to call bright data's scrape and search tools. bypass any Bot-detection or Captcha and extract data from any website in seconds.

License

Notifications You must be signed in to change notification settings

brightdata/bright-data-sdk-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

61 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

sdk-banner(1)

Python SDK by Bright Data, Easy-to-use scalable methods for web search & scraping

Installation

To install the package, open your terminal:

pip install brightdata-sdk

If using macOS, first open a virtual environment for your project

Quick Start

Create a Bright Data account and copy your API key

Initialize the Client

from brightdata import bdclient

client = bdclient(api_token="your_api_token_here") # can also be defined as BRIGHTDATA_API_TOKEN in your .env file

Launch first request

Add to your code a serp function

results = client.search("best selling shoes")

print(client.parse_content(results))
final-banner

Features

Feature Functions Description
Scrape every website scrape Scrape every website using Bright's scraping and unti bot-detection capabilities
Web search search Search google and other search engines by query (supports batch searches)
Web crawling crawl Discover and scrape multiple pages from websites with advanced filtering and depth control
AI-powered extraction extract Extract specific information from websites using natural language queries and OpenAI
Content parsing parse_content Extract text, links, images and structured data from API responses (JSON or HTML)
Browser automation connect_browser Get WebSocket endpoint for Playwright/Selenium integration with Bright Data's scraping browser
Search chatGPT search_chatGPT Prompt chatGPT and scrape its answers, support multiple inputs and follow-up prompts
Search linkedin search_linkedin.posts(), search_linkedin.jobs(), search_linkedin.profiles() Search LinkedIn by specific queries, and recieve structured data
Scrape linkedin scrape_linkedin.posts(), scrape_linkedin.jobs(), scrape_linkedin.profiles(), scrape_linkedin.companies() Scrape LinkedIn and recieve structured data
Download functions download_snapshot, download_content Download content for both sync and async requests
Client class bdclient Handles authentication, automatic zone creation and managment, and options for robust error handling
Parallel processing all functions All functions use Concurrent processing for multiple URLs or queries, and support multiple Output Formats

Try usig one of the functions

Search()

# Simple single query search
result = client.search("pizza restaurants")

# Try using multiple queries (parallel processing), with custom configuration
queries = ["pizza", "restaurants", "delivery"]
results = client.search(
    queries,
    search_engine="bing",
    country="gb",
    format="raw"
)

scrape()

# Simple single URL scrape
result = client.scrape("https://example.com")

# Multiple URLs (parallel processing) with custom options
urls = ["https://example1.com", "https://example2.com", "https://example3.com"]
results = client.scrape(
    "urls",
    format="raw",
    country="gb",
    data_format="screenshot"
)

search_chatGPT()

result = client.search_chatGPT(
    prompt="what day is it today?"
    # prompt=["What are the top 3 programming languages in 2024?", "Best hotels in New York", "Explain quantum computing"],
    # additional_prompt=["Can you explain why?", "Are you sure?", ""]  
)

client.download_content(result) # In case of timeout error, your snapshot_id is presented and you will downloaded it using download_snapshot()

search_linkedin.

Available functions: client.search_linkedin.posts(),client.search_linkedin.jobs(),client.search_linkedin.profiles()

# Search LinkedIn profiles by name
first_names = ["James", "Idan"]
last_names = ["Smith", "Vilenski"]

result = client.search_linkedin.profiles(first_names, last_names) # can also be changed to async
# will print the snapshot_id, which can be downloaded using the download_snapshot() function

scrape_linkedin.

Available functions

client.scrape_linkedin.posts(),client.scrape_linkedin.jobs(),client.scrape_linkedin.profiles(),client.scrape_linkedin.companies()

post_urls = [
    "https://www.linkedin.com/posts/orlenchner_scrapecon-activity-7180537307521769472-oSYN?trk=public_profile",
    "https://www.linkedin.com/pulse/getting-value-out-sunburst-guillaume-de-b%C3%A9naz%C3%A9?trk=public_profile_article_view"
]

results = client.scrape_linkedin.posts(post_urls) # can also be changed to async

print(results) # will print the snapshot_id, which can be downloaded using the download_snapshot() function

crawl()

# Single URL crawl with filters
result = client.crawl(
    url="https://example.com/",
    depth=2,
    filter="/product/",           # Only crawl URLs containing "/product/"
    exclude_filter="/ads/",       # Exclude URLs containing "/ads/"
    custom_output_fields=["markdown", "url", "page_title"]
)
print(f"Crawl initiated. Snapshot ID: {result['snapshot_id']}")

# Download crawl results
data = client.download_snapshot(result['snapshot_id'])

parse_content()

# Parse scraping results
scraped_data = client.scrape("https://example.com")
parsed = client.parse_content(
    scraped_data, 
    extract_text=True, 
    extract_links=True, 
    extract_images=True
)
print(f"Title: {parsed['title']}")
print(f"Text length: {len(parsed['text'])}")
print(f"Found {len(parsed['links'])} links")

extract()

# Basic extraction (URL in query)
result = client.extract("Extract news headlines from CNN.com")
print(result)

# Using URL parameter with structured output
schema = {
    "type": "object",
    "properties": {
        "headlines": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["headlines"]
}

result = client.extract(
    query="Extract main headlines",
    url="https://cnn.com",
    output_scheme=schema
)
print(result)  # Returns structured JSON matching the schema

connect_browser()

# For Playwright (default browser_type)
from playwright.sync_api import sync_playwright

client = bdclient(
    api_token="your_api_token",
    browser_username="username-zone-browser_zone1",
    browser_password="your_password"
)

with sync_playwright() as playwright:
    browser = playwright.chromium.connect_over_cdp(client.connect_browser())
    page = browser.new_page()
    page.goto("https://example.com")
    print(f"Title: {page.title()}")
    browser.close()

download_content (for sync requests)

data = client.scrape("https://example.com")
client.download_content(data) 

download_snapshot (for async requests)

# Save this function to seperate file
client.download_snapshot("") # Insert your snapshot_id

Tip

Hover over the "search" or each function in the package, to see all its available parameters.

Hover-Over1

Function Parameters

πŸ” Search(...)

Searches using the SERP API. Accepts the same arguments as scrape(), plus:

- `query`: Search query string or list of queries
- `search_engine`: "google", "bing", or "yandex"
- Other parameters same as scrape()
πŸ”— scrape(...)

Scrapes a single URL or list of URLs using the Web Unlocker.

- `url`: Single URL string or list of URLs
- `zone`: Zone identifier (auto-configured if None)
- `format`: "json" or "raw"
- `method`: HTTP method
- `country`: Two-letter country code
- `data_format`: "markdown", "screenshot", etc.
- `async_request`: Enable async processing
- `max_workers`: Max parallel workers (default: 10)
- `timeout`: Request timeout in seconds (default: 30)
πŸ•·οΈ crawl(...)

Discover and scrape multiple pages from websites with advanced filtering.

- `url`: Single URL string or list of URLs to crawl (required)
- `ignore_sitemap`: Ignore sitemap when crawling (optional)
- `depth`: Maximum crawl depth relative to entered URL (optional)
- `filter`: Regex to include only certain URLs (e.g. "/product/")
- `exclude_filter`: Regex to exclude certain URLs (e.g. "/ads/")
- `custom_output_fields`: List of output fields to include (optional)
- `include_errors`: Include errors in response (default: True)
πŸ” parse_content(...)

Extract and parse useful information from API responses.

- `data`: Response data from scrape(), search(), or crawl() methods
- `extract_text`: Extract clean text content (default: True)
- `extract_links`: Extract all links from content (default: False)
- `extract_images`: Extract image URLs from content (default: False)
πŸ€– extract(...)

Extract specific information from websites using AI-powered natural language processing with OpenAI.

- `query`: Natural language query describing what to extract (required)
- `url`: Single URL or list of URLs to extract from (optional - if not provided, extracts URL from query)
- `output_scheme`: JSON Schema for OpenAI Structured Outputs (optional - enables reliable JSON responses)
- `llm_key`: OpenAI API key (optional - uses OPENAI_API_KEY env variable if not provided)

# Returns: ExtractResult object (string-like with metadata attributes)
# Available attributes: .url, .query, .source_title, .token_usage, .content_length
🌐 connect_browser(...)

Get WebSocket endpoint for browser automation with Bright Data's scraping browser.

# Required client parameters:
- `browser_username`: Username for browser API (format: "username-zone-{zone_name}")
- `browser_password`: Password for browser API authentication
- `browser_type`: "playwright", "puppeteer", or "selenium" (default: "playwright")

# Returns: WebSocket endpoint URL string
πŸ’Ύ Download_Content(...)

Save content to local file.

- `content`: Content to save
- `filename`: Output filename (auto-generated if None)
- `format`: File format ("json", "csv", "txt", etc.)
βš™οΈ Configuration Constants

Constant Default Description
DEFAULT_MAX_WORKERS 10 Max parallel tasks
DEFAULT_TIMEOUT 30 Request timeout (in seconds)
CONNECTION_POOL_SIZE 20 Max concurrent HTTP connections
MAX_RETRIES 3 Retry attempts on failure
RETRY_BACKOFF_FACTOR 1.5 Exponential backoff multiplier

Advanced Configuration

πŸ”§ Environment Variables

Create a .env file in your project root:

BRIGHTDATA_API_TOKEN=your_bright_data_api_token
WEB_UNLOCKER_ZONE=your_web_unlocker_zone        # Optional
SERP_ZONE=your_serp_zone                        # Optional
BROWSER_ZONE=your_browser_zone                  # Optional
BRIGHTDATA_BROWSER_USERNAME=username-zone-name  # For browser automation
BRIGHTDATA_BROWSER_PASSWORD=your_browser_password  # For browser automation
OPENAI_API_KEY=your_openai_api_key              # For extract() function
🌐 Manage Zones

List all active zones

# List all active zones
zones = client.list_zones()
print(f"Found {len(zones)} zones")

Configure a custom zone name

client = bdclient(
    api_token="your_token",
    auto_create_zones=False,          # Else it creates the Zone automatically
    web_unlocker_zone="custom_zone",
    serp_zone="custom_serp_zone"
)
πŸ‘₯ Client Management

bdclient Class - Complete parameter list

bdclient(
    api_token: str = None,                    # Your Bright Data API token (required)
    auto_create_zones: bool = True,           # Auto-create zones if they don't exist
    web_unlocker_zone: str = None,            # Custom web unlocker zone name
    serp_zone: str = None,                    # Custom SERP zone name
    browser_zone: str = None,                 # Custom browser zone name
    browser_username: str = None,             # Browser API username (format: "username-zone-{zone_name}")
    browser_password: str = None,             # Browser API password
    browser_type: str = "playwright",         # Browser automation tool: "playwright", "puppeteer", "selenium"
    log_level: str = "INFO",                  # Logging level: "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"
    structured_logging: bool = True,          # Use structured JSON logging
    verbose: bool = None                      # Enable verbose logging (overrides log_level if True)
)
⚠️ Error Handling

bdclient Class

The SDK includes built-in input validation and retry logic

In case of zone related problems, use the list_zones() function to check your active zones, and check that your account settings, to verify that your API key have "admin permissions".

Support

For any issues, contact Bright Data support, or open an issue in this repository.

About

Bright Data's python SDK, use it to call bright data's scrape and search tools. bypass any Bot-detection or Captcha and extract data from any website in seconds.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages