Skip to content
John PENDENQUE edited this page Apr 14, 2025 · 23 revisions

Kryptone is a web scraping framework wraping Selenium and designed for marketers eager for efficient web crawling and data extraction.

Getting started

To begin using Kryptone, initialize a new project with the following command:

python -m kryptone start_project project_name

This command generates the following project structure:

├── project
│   ├── media
│   │   ├── /**/*.json
│   │   ├── /**/*.jpeg
│   ├── cache.json
│   ├── kryptone.log
│   ├── manage.py
│   ├── models.py
│   ├── settings.py
│   └── spiders.py

Projects are self-contained, allowing for the organization of spiders and data collection methods.

Crawling a website

To initiate a crawl, define an entry point URL from which the spider will begin gathering data. There are several methods for providing this URL:

Using Meta.start_urls without a file:

from kryptone.base import SiteCrawler

class MyWebscrapper(SiteCrawler):
    class Meta:
        start_urls = ['http://example.com']

Using a file loader:

from kryptone.base import SiteCrawler
from kryptone.utils.urls import LoadStartUrls

class MyWebscrapper(SiteCrawler):
    class Meta:
        start_urls = LoadStartUrls()

By default, LoadStartUrls expects a file named start_urls.csv or start_urls.json in the project directory. The file should contain a list of URLs to be crawled.

How urls are added to the queue

In crawl mode, the spider will add every urls it sees on the current page to the queue. This includes links to other pages, images, and any other URLs present in the HTML content.

Each url is transformed into a kryptone.utils.URL object, which provides various methods for manipulating and filtering them. The URL object is then added to the queue for processing.

The manner in which the spider gathers urls can be altered in various manners:

However, every url is not necessarily valid to be visited. Therefore, before an url is added to the queue, it goes through a check process and are ignored:

  • Urls that are no the same domain as the start url
  • Urls that are empty
  • Urls that are a fragment (e.g., #section)
  • Urls that end with "/" which are not the start url
  • If Meta.ignore_images is set to True, image URLs are ignored
  • Urls that were already visited (visited urls are stored in the visited_urls list)
  • Urls that were already seen (seen urls are stored in the seen_urls list)

Urls that succeed these tests are considered valid and then tested by the filters registered in Meta.url_ignore_tests through the run_url_filters method.

Pre-Crawl Setup

setup_class hook

The setup_class method ensures that the storage class is initialized before the spider starts. This method is called once per class, making it ideal for setting up class-level resources or configurations. It is executed before the before_start method, allowing users to perform any necessary setup before the spider begins its crawling process.

If also creates all the necessary files such as uuid_map.json in order to store the different states of the spiders.

before_start hook

The before_start method is a hook provided by the SiteCrawler class in the Kryptone framework. It is designed to execute preparatory steps just before the spider begins visiting a URL. Users can override this method in their custom spider classes to add their own preprocessing steps or to extend the functionality provided by the base class.

To ensure that any necessary setup defined in the SiteCrawler base class is executed along with any additional logic added in the subclass, it's recommended to call super().before_start() within the overridden before_start method.

Here's an example of how to override the before_start method in a custom spider class:

from kryptone.base import SiteCrawler

class MyWebScraper(SiteCrawler):
    start_url = 'http://example.com'

    def before_start(self):
        super().before_start() # Ensures base class logic is executed

Example Use Cases

  • Logging: Initialize logging configuration before starting the crawl.
  • Custom Headers: Set custom HTTP headers for requests.
  • Authentication: Perform authentication or login actions before accessing restricted pages.
  • Environment Setup: Configure the environment or load necessary resources before crawling begins.

Crawling hooks

These are actions that can be executed at different stages of the crawling process. They allow you to customize the behavior of your spider and perform specific actions at various points during the crawl.

The hooks are executed in this specific order: post_navigation_actions, current_page_actions then before_next_page_actions.

Post navigation actions

Actions to be executed immediately after loading a page (e.g., clicking a cookie consent button)

Current page actions

Actions that execute on every visited page.

from kryptone.base import SiteCrawler

class MyWebscrapper(BaseCrawler):
    start_url = 'http://example.com'

    def current_page_actions(self, current_url, **kwargs):
        # Do something here
        pass

Before next page actions

Actions to be executed just before navigating to the next page (e.g., clicking a "Next" button) but running checks on the next url to be visited.

from kryptone.base import BaseCrawler

class MyWebscrapper(BaseCrawler):
    start_url = 'http://example.com'

    def post_navigation_actions(self, **kwargs):
        self.click_consent_button(element_id='button')

Conditional page actions

You can define actions that depend on the current URL, leveraging kryptone.utils.URL. See utilities.

Performance monitoring

When the spider is executed in crawl mode, Meta.crawl is True, the spider writes a performance file useful for monitoring the performance of the spider. This file is saved in the media folder and contains information about the time taken to crawl each URL, the number of requests made, and other relevant metrics.

Seen urls

Are all the urls that have been seen by the spider on every page that has been visited.

Visited urls

Are all the urls that have been visited. This is a subset of the seen urls.

Completion percentage

The completion percentage is calculated based on the number of visited URLs compared to the total number of URLs in the start_urls list.

This value is accesible via the calculate_completion_percentage cached property attribute.

Saving data

The data collected by the spider is saved in JSON format on the DATA_CONTAINER attribute of the class.

import dataclasses


@dataclasses.dataclass
class DataContainer:
    title: str = None
    url: str = None
    content: str = None


class ExampleSpider(SiteCrawler):
    model = DataContainer

    class Meta:
        start_urls = ['http://example.com']

    def current_page_actions(self, current_url, **kwargs):
        data = {
            'title': self.get_title(),
            'url': current_url,
            'content': self.get_content()
        }
        self.save_data(data)

You can save extracted data by calling save_data(data) where data is a dictionary containing the data you want to save.

You need to define a dataclass object in order to use the save_data method. The dataclass object can be defined in the models.py file of your project.

The dataclass should contain the fields you want to save, and the save_data method will automatically map the data to the fields

After data save hook

The def after_data_save method is a hook that is called after data has been saved.

This allows you to perform additional actions or modifications after the data has been stored.

Downloading images and content

The download_images method allows you to download images from a given URL and save them in the media folder of your project.

The method takes a list of image URLs and downloads them to the specified directory.

import dataclasses

class ExampleSpider(SiteCrawler):
    class Meta:
        start_urls = ['http://example.com']

    def current_page_actions(self, current_url, **kwargs):
        urls = ['http://examle.com/image.jpg']
        self.download_images(urls)

Adding more urls

While in crawl mode, the spider will add every urls it sees on the page. However, in cases where you might need to add a specific set of urls to the currentl list of urls to visit, you can call add_urls.

Clone this wiki locally