An asyncio-style web scraping framework inspired by Scrapy, powered by
curl_cffi.
scrapy_cffi is a lightweight Python crawler framework that mimics the Scrapy architecture while replacing Twisted with curl_cffi as the underlying HTTP/WebSocket client.
It is designed to be efficient, modular, and suitable for both simple tasks and large-scale distributed crawlers.
-
Scrapy-style architecture: spiders, items, interceptors, pipelines, signals
-
Fully asyncio-based engine for maximum concurrency
-
HTTP & WebSocket support: built-in asynchronous clients
-
Flexible DB integration: Redis, MySQL, MongoDB with async retry & reconnect
-
Message queue support: RabbitMQ & Kafka
-
Configurable deployment: settings system supporting .env files, single-instance, cluster mode, and sentinel mode
-
Lightweight middleware & interceptor system for easy extensions
-
High-performance C-extension hooks for CPU-intensive tasks
-
Redis-compatible scheduler (optional) for distributed crawling
-
Designed for high-concurrency, high-availability crawling
pip install scrapy_cffigit clone https://github.com/aFunnyStrange/scrapy_cffi.git
cd scrapy_cffi
pip install -e .scrapy-cffi startproject <project_name>
cd <project_name>
scrapy-cffi genspider <spider_name> <domain>
python runner.pyNotes:
The CLI command is
scrapy_cffiin versions ≤0.1.4 andscrapy-cffiin versions >0.1.4 for improved usability.
Starting from
scrapy-cffi >= 0.2.5,RedisSchedulerandRabbitMqSchedulerno longer automatically terminate when the queue is empty. For finite/terminable spiders, useSCHEDULER_LOOP_ENDto specify the number of scheduler loops before automatic exit. For continuous-listening spiders (RedisSpider,RabbitMqSpider, or custom persistent spiders), leaveSCHEDULER_LOOP_ENDasNone. This change only affects automatic termination; task scheduling remains fully functional.
scrapy_cffi now fully supports a flexible settings system:
-
Load configuration from Python files or
.envfiles -
Choose between single-instance, cluster, or sentinel mode
-
Configure databases, message queues, and concurrency limits in one place
-
Seamless integration with async Redis/MySQL/MongoDB managers
Example settings.py snippet:
settings.REDIS_INFO.MODE = "sentinel"
settings.REDIS_INFO.SENTINELS = [("<sentinel_host1>", "int(sentinel_port1)"), ("<sentinel_host2>", "int(sentinel_port2)"), ("<sentinel_host3>", "int(sentinel_port3)")]
settings.REDIS_INFO.MASTER_NAME = "<master_name>"
settings.REDIS_INFO.SENTINEL_OVERRIDE_MASTER = ("master_host", "int(master_port)")Full technical documentation and module-level guides are available in the docs/ directory.
BSD 3-Clause License. See LICENSE for details.
Inspired by the challenges of async Python crawling:
-
Blocking requests and slow DB integration
-
Complex deployment for distributed crawlers
-
Need for fully concurrent HTTP & WebSocket requests
scrapy_cffi addresses these with a modular, high-performance framework that is async-first, extensible, and deployment-ready.