Skip to content

Commit 73d8e7b

Browse files
authored
Merge pull request #29 from brightdata/dev
Adding Datasets
2 parents 143a378 + 1754652 commit 73d8e7b

File tree

213 files changed

+15459
-161
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

213 files changed

+15459
-161
lines changed

.github/workflows/lint.yml

Lines changed: 0 additions & 33 deletions
This file was deleted.

.github/workflows/publish.yml

Lines changed: 0 additions & 30 deletions
This file was deleted.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Archived SDK versions and reference implementations
22
archive/
3+
tests
34

45
# Byte-compiled / optimized / DLL files
56
__pycache__/

CHANGELOG.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,67 @@
11
# Bright Data Python SDK Changelog
22

3+
## Version 2.2.1 - 100 Datasets API
4+
5+
### ✨ New Features
6+
7+
#### Expanded Datasets Coverage
8+
Added 92 new dataset integrations, bringing the total to **100 datasets**:
9+
10+
- **Luxury Brands**: Loewe, Berluti, Moynat, Hermes, Delvaux, Prada, Montblanc, YSL, Dior, Balenciaga, Bottega Veneta, Celine, Chanel, Fendi
11+
- **E-commerce**: Amazon (Reviews, Sellers), Walmart, Shopee, Lazada, Zalando, Sephora, Zara, Mango, Massimo Dutti, Asos, Shein, Ikea, H&M, Lego, Mouser, Digikey
12+
- **Social Media**: Instagram (Profiles, Posts), TikTok, Pinterest (Posts, Profiles), YouTube (Profiles, Videos, Comments), Facebook Pages Posts
13+
- **Real Estate**: Zillow, Airbnb, Australia Real Estate, Otodom Poland, Zonaprop Argentina, Metrocuadrado, Infocasas Uruguay, Properati, Toctoc, Inmuebles24 Mexico, Yapo Chile
14+
- **Business Data**: Glassdoor (Companies, Reviews, Jobs), Indeed (Companies, Jobs), ZoomInfo, PitchBook, G2, Trustpilot, TrustRadius, Owler, Slintel, Manta, VentureRadar, Companies Enriched, Employees Enriched
15+
- **Other**: World Zipcodes, US Lawyers, Google Maps Reviews, Yelp, Xing Profiles, OLX Brazil, Webmotors Brasil, Chileautos, LinkedIn Jobs
16+
17+
#### SERP Pagination Support
18+
Added sequential querying to retrieve more than 10 search results from Google:
19+
20+
```python
21+
async with BrightDataClient() as client:
22+
# Get up to 50 results with automatic pagination
23+
results = await client.search.google(
24+
query="python programming",
25+
num_results=50 # Fetches multiple pages sequentially
26+
)
27+
```
28+
29+
---
30+
31+
## Version 2.2.0 - Datasets API
32+
33+
### ✨ New Features
34+
35+
#### Datasets API
36+
Access Bright Data's pre-collected datasets with filtering and export capabilities.
37+
38+
```python
39+
async with BrightDataClient() as client:
40+
# Filter dataset records
41+
snapshot_id = await client.datasets.amazon_products(
42+
filter={"name": "rating", "operator": ">=", "value": 4.5},
43+
records_limit=100
44+
)
45+
# Download results
46+
data = await client.datasets.amazon_products.download(snapshot_id)
47+
```
48+
49+
**8 Datasets:** LinkedIn Profiles, LinkedIn Companies, Amazon Products, Crunchbase Companies, IMDB Movies, NBA Players Stats, Goodreads Books, World Population
50+
51+
**Export Utilities:**
52+
```python
53+
from brightdata.datasets import export_json, export_csv
54+
export_json(data, "results.json")
55+
export_csv(data, "results.csv")
56+
```
57+
58+
### 📓 Notebooks
59+
- `notebooks/datasets/linkedin/linkedin.ipynb` - LinkedIn datasets (profiles & companies)
60+
- `notebooks/datasets/amazon/amazon.ipynb` - Amazon products dataset
61+
- `notebooks/datasets/crunchbase/crunchbase.ipynb` - Crunchbase companies dataset
62+
63+
---
64+
365
## Version 2.1.2 - Web Scrapers & Notebooks
466

567
### 🐛 Bug Fixes

LICENSE

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,3 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
1919
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
2020
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
2121
SOFTWARE.
22-

MANIFEST.in

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,3 @@ include CHANGELOG.md
44
include pyproject.toml
55
recursive-include src *.py
66
recursive-include src *.typed
7-

README.md

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Bright Data Python SDK
22

3-
The official Python SDK for [Bright Data](https://brightdata.com) APIs. Scrape any website, get SERP results, bypass bot detection and CAPTCHAs.
3+
The official Python SDK for [Bright Data](https://brightdata.com) APIs. Scrape any website, get SERP results, bypass bot detection and CAPTCHAs, and access 100+ ready-made datasets.
44

55
[![Python](https://img.shields.io/badge/python-3.9%2B-blue)](https://www.python.org/)
66
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
@@ -135,6 +135,55 @@ async with BrightDataClient() as client:
135135
- `client.scrape.instagram` - profiles, posts, comments, reels
136136
- `client.scrape.facebook` - posts, comments, reels
137137

138+
## Datasets API
139+
140+
Access 100+ ready-made datasets from Bright Data — pre-collected, structured data from popular platforms.
141+
142+
```python
143+
async with BrightDataClient() as client:
144+
# Filter a dataset — returns a snapshot_id
145+
snapshot_id = await client.datasets.imdb_movies(
146+
filter={"name": "title", "operator": "includes", "value": "black"},
147+
records_limit=5
148+
)
149+
150+
# Download when ready (polls until snapshot is complete)
151+
data = await client.datasets.imdb_movies.download(snapshot_id)
152+
print(f"Got {len(data)} records")
153+
154+
# Quick sample: .sample() auto-discovers fields, no filter needed
155+
# Works on any dataset
156+
snapshot_id = await client.datasets.imdb_movies.sample(records_limit=5)
157+
```
158+
159+
**Export results to file:**
160+
161+
```python
162+
from brightdata.datasets import export
163+
164+
export(data, "results.json") # JSON
165+
export(data, "results.csv") # CSV
166+
export(data, "results.jsonl") # JSONL
167+
```
168+
169+
**Available dataset categories:**
170+
- **E-commerce:** Amazon, Walmart, Shopee, Lazada, Zalando, Zara, H&M, Shein, IKEA, Sephora, and more
171+
- **Business intelligence:** ZoomInfo, PitchBook, Owler, Slintel, VentureRadar, Manta
172+
- **Jobs & HR:** Glassdoor (companies, reviews, jobs), Indeed (companies, jobs), Xing
173+
- **Reviews:** Google Maps, Yelp, G2, Trustpilot, TrustRadius
174+
- **Social media:** Pinterest (posts, profiles), Facebook Pages
175+
- **Real estate:** Zillow, Airbnb, and 8+ regional platforms
176+
- **Luxury brands:** Chanel, Dior, Prada, Balenciaga, Hermes, YSL, and more
177+
- **Entertainment:** IMDB, NBA, Goodreads
178+
179+
**Discover available fields:**
180+
181+
```python
182+
metadata = await client.datasets.imdb_movies.get_metadata()
183+
for name, field in metadata.fields.items():
184+
print(f"{name}: {field.type}")
185+
```
186+
138187
## Async Usage
139188

140189
Run multiple requests concurrently:

0 commit comments

Comments
 (0)