-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Kryptone is a web scraping framework wraping Selenium and designed for marketers eager for efficient web crawling and data extraction.
To begin using Kryptone, initialize a new project with the following command:
python -m kryptone start_project project_nameThis command generates the following project structure:
├── project
│ ├── media
│ │ ├── /**/*.json
│ │ ├── /**/*.jpeg
│ ├── cache.json
│ ├── kryptone.log
│ ├── manage.py
│ ├── models.py
│ ├── settings.py
│ └── spiders.pyProjects are self-contained, allowing for the organization of spiders and data collection methods.
To initiate a crawl, define an entry point URL from which the spider will begin gathering data. There are several methods for providing this URL:
Using Meta.start_urls without a file:
from kryptone.base import SiteCrawler
class MyWebscrapper(SiteCrawler):
class Meta:
start_urls = ['http://example.com']Using a file loader:
from kryptone.base import SiteCrawler
from kryptone.utils.urls import LoadStartUrls
class MyWebscrapper(SiteCrawler):
class Meta:
start_urls = LoadStartUrls()By default, LoadStartUrls expects a file named start_urls.csv or start_urls.json in the project directory. The file should contain a list of URLs to be crawled.
In crawl mode, the spider will add every urls it sees on the current page to the queue. This includes links to other pages, images, and any other URLs present in the HTML content.
Each url is transformed into a kryptone.utils.URL object, which provides various methods for manipulating and filtering them. The URL object is then added to the queue for processing.
The manner in which the spider gathers urls can be altered in various manners:
- By using url filters using ignore tests
However, every url is not necessarily valid to be visited. Therefore, before an url is added to the queue, it goes through a check process and are ignored:
- Urls that are no the same domain as the start url
- Urls that are empty
- Urls that are a fragment (e.g.,
#section) - Urls that end with "/" which are not the start url
- If
Meta.ignore_imagesis set to True, image URLs are ignored - Urls that were already visited (visited urls are stored in the
visited_urlslist) - Urls that were already seen (seen urls are stored in the
seen_urlslist)
Urls that succeed these tests are considered valid and then tested by the filters registered in Meta.url_ignore_tests through the run_url_filters method.
The setup_class method ensures that the storage class is initialized before the spider starts. This method is called once per class, making it ideal for setting up class-level resources or configurations.
It is executed before the before_start method, allowing users to perform any necessary setup before the spider begins its crawling process.
If also creates all the necessary files such as uuid_map.json in order to store the different states of the spiders.
The before_start method is a hook provided by the SiteCrawler class in the Kryptone framework. It is designed to execute preparatory steps just before the spider begins visiting a URL. Users can override this method in their custom spider classes to add their own preprocessing steps or to extend the functionality provided by the base class.
To ensure that any necessary setup defined in the SiteCrawler base class is executed along with any additional logic added in the subclass, it's recommended to call super().before_start() within the overridden before_start method.
Here's an example of how to override the before_start method in a custom spider class:
from kryptone.base import SiteCrawler
class MyWebScraper(SiteCrawler):
start_url = 'http://example.com'
def before_start(self):
super().before_start() # Ensures base class logic is executedExample Use Cases
- Logging: Initialize logging configuration before starting the crawl.
- Custom Headers: Set custom HTTP headers for requests.
- Authentication: Perform authentication or login actions before accessing restricted pages.
- Environment Setup: Configure the environment or load necessary resources before crawling begins.
These are actions that can be executed at different stages of the crawling process. They allow you to customize the behavior of your spider and perform specific actions at various points during the crawl.
The hooks are executed in this specific order: post_navigation_actions, current_page_actions then before_next_page_actions.
Actions to be executed immediately after loading a page (e.g., clicking a cookie consent button)
Actions that execute on every visited page.
from kryptone.base import SiteCrawler
class MyWebscrapper(BaseCrawler):
start_url = 'http://example.com'
def current_page_actions(self, current_url, **kwargs):
# Do something here
passActions to be executed just before navigating to the next page (e.g., clicking a "Next" button) but running checks on the next url to be visited.
from kryptone.base import BaseCrawler
class MyWebscrapper(BaseCrawler):
start_url = 'http://example.com'
def post_navigation_actions(self, **kwargs):
self.click_consent_button(element_id='button')You can define actions that depend on the current URL, leveraging kryptone.utils.URL. See utilities.
When the spider is executed in crawl mode, Meta.crawl is True, the spider writes a performance file useful for monitoring the performance of the spider. This file is saved in the media folder and contains information about the time taken to crawl each URL, the number of requests made, and other relevant metrics.
Are all the urls that have been seen by the spider on every page that has been visited.
Are all the urls that have been visited. This is a subset of the seen urls.
The completion percentage is calculated based on the number of visited URLs compared to the total number of URLs in the start_urls list.
This value is accesible via the calculate_completion_percentage cached property attribute.
The data collected by the spider is saved in JSON format on the DATA_CONTAINER attribute of the class.
import dataclasses
@dataclasses.dataclass
class DataContainer:
title: str = None
url: str = None
content: str = None
class ExampleSpider(SiteCrawler):
model = DataContainer
class Meta:
start_urls = ['http://example.com']
def current_page_actions(self, current_url, **kwargs):
data = {
'title': self.get_title(),
'url': current_url,
'content': self.get_content()
}
self.save_data(data)You can save extracted data by calling save_data(data) where data is a dictionary containing the data you want to save.
You need to define a dataclass object in order to use the save_data method. The dataclass object can be defined in the models.py file of your project.
The dataclass should contain the fields you want to save, and the save_data method will automatically map the data to the fields
The def after_data_save method is a hook that is called after data has been saved.
This allows you to perform additional actions or modifications after the data has been stored.
The download_images method allows you to download images from a given URL and save them in the media folder of your project.
The method takes a list of image URLs and downloads them to the specified directory.
import dataclasses
class ExampleSpider(SiteCrawler):
class Meta:
start_urls = ['http://example.com']
def current_page_actions(self, current_url, **kwargs):
urls = ['http://examle.com/image.jpg']
self.download_images(urls)While in crawl mode, the spider will add every urls it sees on the page. However, in cases where you might need to add a specific set of urls to the currentl list of urls to visit, you can call add_urls.
test