Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 45 additions & 16 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,42 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.1.0] - 2024-12-29
## [0.7.0] - 2025-04-30

### Added
- Initial release of the WaterCrawl Python client
- Basic API client functionality with request handling
- Support for synchronous and asynchronous crawling
- Comprehensive crawling options and configurations
- Built-in request monitoring and result downloading
- Session management and request handling
- MIT License
- Basic documentation
- New search API functionality with methods:
- `create_search_request`: Create search requests with customizable options
- `monitor_search_request`: Monitor the progress of search operations
- `get_search_request`: Retrieve details of a search request
- `stop_search_request`: Cancel an ongoing search operation
- Comprehensive test suite for all search-related API methods
- Enhanced error handling and retry mechanisms for more reliable API interactions
- Better diagnostic capabilities for API errors
- Updated documentation with complete examples of all API methods

### Changed
- None (initial release)
- Improved test suite with better error handling, diagnostic information, and retry logic
- Enhanced documentation with comprehensive examples in README

### Deprecated
- None

### Removed
- None

### Fixed
- Fixed sitemap-related tests to handle actual API responses correctly

### Security
- None

## [0.6.1] - 2025-04-20

### Added
- Added support for specifying `page_number` and `page_size` parameters to results endpoints for improved pagination and control over result sets.

### Changed
- None

### Deprecated
- None
Expand All @@ -30,8 +52,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- None

### Security
- Basic API key authentication
- Secure request handling with HTTPS
- None

## [0.6.0] - 2025-04-20

Expand All @@ -55,13 +76,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Security
- None

## [0.6.1] - 2025-04-20
## [0.1.0] - 2024-12-29

### Added
- Added support for specifying `page_number` and `page_size` parameters to results endpoints for improved pagination and control over result sets.
- Initial release of the WaterCrawl Python client
- Basic API client functionality with request handling
- Support for synchronous and asynchronous crawling
- Comprehensive crawling options and configurations
- Built-in request monitoring and result downloading
- Session management and request handling
- MIT License
- Basic documentation

### Changed
- None
- None (initial release)

### Deprecated
- None
Expand All @@ -73,4 +101,5 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- None

### Security
- None
- Basic API key authentication
- Secure request handling with HTTPS
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2024 WaterCrawl-Plugin
Copyright (c) 2025 WaterCrawl-Python

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
217 changes: 215 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ A Python client library for interacting with the WaterCrawl API - a powerful web
pip install watercrawl-py
```


## Quick Start

```python
Expand All @@ -33,7 +32,220 @@ crawl_request = client.create_crawl_request(
# Monitor and download results
for result in client.monitor_crawl_request(crawl_request['uuid']):
if result['type'] == 'result':
print(result['data']) // it is a result object per page
print(result['data']) # it is a result object per page
```

## API Examples

### Client Initialization

```python
from watercrawl import WaterCrawlAPIClient

# Initialize with default base URL
client = WaterCrawlAPIClient('your-api-key')

# Or specify a custom base URL
client = WaterCrawlAPIClient('your-api-key', base_url='https://custom-app.watercrawl.dev/')
```

### Crawling Operations

#### List all crawl requests

```python
# Get the first page of requests (default page size: 10)
requests = client.get_crawl_requests_list()

# Specify page number and size
requests = client.get_crawl_requests_list(page=2, page_size=20)
```

#### Get a specific crawl request

```python
request = client.get_crawl_request('request-uuid')
```

#### Create a crawl request

```python
# Simple request with just a URL
request = client.create_crawl_request(url='https://example.com')

# Advanced request with a single URL
request = client.create_crawl_request(
url='https://example.com',
spider_options={
"max_depth": 1, # maximum depth to crawl
"page_limit": 1, # maximum number of pages to crawl
"allowed_domains": [], # allowed domains to crawl
"exclude_paths": [], # exclude paths
"include_paths": [] # include paths
},
page_options={
"exclude_tags": [], # exclude tags from the page
"include_tags": [], # include tags from the page
"wait_time": 1000, # wait time in milliseconds after page load
"include_html": False, # the result will include HTML
"only_main_content": True, # only main content of the page automatically remove headers, footers, etc.
"include_links": False, # if True the result will include links
"timeout": 15000, # timeout in milliseconds
"accept_cookies_selector": None, # accept cookies selector e.g. "#accept-cookies"
"locale": "en-US", # locale
"extra_headers": {}, # extra headers e.g. {"Authorization": "Bearer your_token"}
"actions": [] # actions to perform {"type": "screenshot"} or {"type": "pdf"}
},
plugin_options={}
)
```

#### Stop a crawl request

```python
client.stop_crawl_request('request-uuid')
```

#### Download a crawl request result

```python
# Download the crawl request as a ZIP file
zip_data = client.download_crawl_request('request-uuid')

# Save to a file
with open('crawl_results.zip', 'wb') as f:
f.write(zip_data)
```

#### Monitor a crawl request

```python
# Monitor with automatic result download (default)
for event in client.monitor_crawl_request('request-uuid'):
if event['type'] == 'state':
print(f"Crawl state: {event['data']['status']}")
elif event['type'] == 'result':
print(f"Received result for: {event['data']['url']}")

# Monitor without downloading results will return result as url instead of result object
for event in client.monitor_crawl_request('request-uuid', download=False):
print(f"Event type: {event['type']}")
```

#### Get crawl request results

```python
# Get the first page of results
results = client.get_crawl_request_results('request-uuid')

# Specify page number and size
results = client.get_crawl_request_results('request-uuid', page=2, page_size=20)
```

#### Quick URL scraping

```python
# Synchronous scraping (default)
result = client.scrape_url('https://example.com')

# With page options
result = client.scrape_url(
'https://example.com',
page_options={}
)

# Asynchronous scraping
request = client.scrape_url('https://example.com', sync=False)
# Later check for results with get_crawl_request
```

### Sitemap Operations

#### Download a sitemap

```python
# Download using a crawl request object
crawl_request = client.get_crawl_request('request-uuid')
sitemap = client.download_sitemap(crawl_request)

# you need to give crawl request uuid or crawl request object
sitemap = client.download_sitemap('request-uuid')

# Process sitemap entries
for entry in sitemap:
print(f"URL: {entry['url']}, Title: {entry['title']}")
```

#### Download sitemap as graph data

```python
# you need to give crawl request uuid or crawl request object
graph_data = client.download_sitemap_graph('request-uuid')
```

#### Download sitemap as markdown

```python
# you need to give crawl request uuid or crawl request object
markdown = client.download_sitemap_markdown('request-uuid')
```

### Search Operations

#### Create a search request

```python
# Simple search
search = client.create_search_request(query="python programming")

# Search with options and limited results
search = client.create_search_request(
query="python tutorial",
search_options={
"language": null, # language code e.g. "en" or "fr" or "es"
"country": null, # country code e.g. "us" or "fr" or "es"
"time_renge": "any", # time range e.g. "any" or "hour" or "day" or "week" or "month" or "year"
"search_type": "web", # search type e.g. "web" now just web is supported
"depth": "basic" # depth e.g. "basic" or "advanced" or "ultimate"
},
result_limit=5, # limit the number of results
sync=True, # wait for results
download=True # download results
)

# Asynchronous search
search = client.create_search_request(
query="machine learning",
search_options={},
result_limit=5, # limit the number of results
sync=False, # Don't wait for results
download=False # Don't download results
)
```

#### Monitor a search request

```python
# Monitor with automatic result download the event type just state for now
for event in client.monitor_search_request('search-uuid'):
if event['type'] == 'state':
print(f"Search state: {event['status']}")

# Monitor without downloading results
for event in client.monitor_search_request('search-uuid', download=False):
print(f"Event: {event}")
```

#### Get search request details

```python
search = client.get_search_request('search-uuid', download=True)
```

#### Stop a search request

```python
client.stop_search_request('search-uuid')
```

## Features
Expand All @@ -43,6 +255,7 @@ for result in client.monitor_crawl_request(crawl_request['uuid']):
- Comprehensive crawling options and configurations
- Built-in request monitoring and result downloading
- Efficient session management and request handling
- Support for sitemaps and search operations

## Documentation

Expand Down
Loading