Skip to content
Martin Olveyra edited this page Oct 28, 2021 · 24 revisions

Table of Contents

Tutorial

Introduction

What is shub-workflow

This library is originally inspired on tools like Apache Airflow. Its original purpose is to run worflows on ScrapyCloud and fully take advantage on it, without relying on external components required by Airflow or similar tools.

With time it evolved to much more than that. Currently shub-workflow is a suite of classes for defining and controlling simple and complex workflow of spiders and scripts running over zyte ScrapyCloud platform. It also provides additional tools for performing specific, frequently needed tasks during workflows, like data delivering, job cloning, s3 and gcs storage, and bases classes for defining scripts meant to perform custom tasks in the context of a workflow.

Many of these additional components come from the harvesting, fitting and generalization of code seen in many different projects developed by many people. So the net result is a library that gathers good practices and ideas from many people, with the aim to promote their standardization.

There are a couple of related libraries that frequently work together with shub-workflow, because scrapy spiders workflows usually relies on Hubstorage Crawl Frontier (HCF) capabilities:

Even more, hcf-backend provides a crawl manager subclassed from shub-workflow base crawl manager class, which facilitates the scheduling of consumer spiders (spiders that consumes requests from a frontier) and can be one task of a workflow. In the present tutorial we will exemplify the usage of them too.

However, workflows defined with shub-workflow are not limited to the usage of HCF. Any storage technology can be used and mixed, and in practice it is being used for coordination of workflow pipelines with spiders and post processing scripts running on ScrapyCloud, using storage technologies like S3 or GCS for massive data exchange between them. The library also provides utils for working conveniently with those technologies in the context of the workflow pipelines built with it.

Note: This tutorial assumes appropiate knowledge of ScrapyCloud platform, how to use it, deploy code and scripts on it, etc.

Installation

pip install shub-workflow

Next Chapter: Crawl Managers

Clone this wiki locally