CachedResourceFilter

The CachedResourceFilter is a prefetch filter that allows you to skip downloading resources that are already cached and younger than a specified maximum age. This is useful for incremental crawls where you want to avoid re-downloading recently fetched content.

Features

Skip downloading resources that are already cached on disk
Configurable maximum age for cached resources
Works with file-based persistence handlers
Share cache across multiple spider runs by using the same spider ID

Usage

Basic Example

use VDB\Spider\Spider;
use VDB\Spider\Filter\Prefetch\CachedResourceFilter;
use VDB\Spider\PersistenceHandler\FileSerializedResourcePersistenceHandler;

// Use a fixed spider ID to share cache across runs
$spiderId = 'my-spider-cache';
$spider = new Spider('http://example.com', null, null, null, $spiderId);

// Set up file persistence
$resultsPath = __DIR__ . '/cache';
$spider->getDownloader()->setPersistenceHandler(
    new FileSerializedResourcePersistenceHandler($resultsPath)
);

// Add the cache filter with a 1-hour max age
$maxAgeSeconds = 3600; // 1 hour
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, $maxAgeSeconds);
$spider->getDiscovererSet()->addFilter($cacheFilter);

// Crawl - cached resources within 1 hour will be skipped
$spider->crawl();

Configuration Options

Constructor Parameters

$basePath (string): The base directory where spider results are stored
$spiderId (string): The spider ID used for the cache directory (must match the persistence handler's spider ID)
$maxAgeSeconds (int): Maximum age in seconds for cached resources. Set to 0 to always use cache regardless of age.

Max Age Behavior

The $maxAgeSeconds parameter controls how the filter determines if a cached resource is "fresh":

$maxAgeSeconds > 0: Only skip downloading if the cached file's modification time is less than $maxAgeSeconds seconds old
$maxAgeSeconds = 0: Always skip downloading if the file exists in cache, regardless of age (useful for archival crawls)

Examples

Always use cache (archival mode)

// Never re-download cached resources
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, 0);
$spider->getDiscovererSet()->addFilter($cacheFilter);

Daily incremental crawls

// Re-download resources older than 24 hours
$maxAgeSeconds = 86400; // 24 hours
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, $maxAgeSeconds);
$spider->getDiscovererSet()->addFilter($cacheFilter);

Hourly updates

// Re-download resources older than 1 hour
$maxAgeSeconds = 3600; // 1 hour
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, $maxAgeSeconds);
$spider->getDiscovererSet()->addFilter($cacheFilter);

Important Notes

Spider ID Consistency

For the cache to work across multiple runs, you must use the same spider ID:

// ✓ Correct: Fixed spider ID for cache sharing
$spiderId = 'my-consistent-id';
$spider = new Spider($seed, null, null, null, $spiderId);
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, $maxAge);

// ✗ Wrong: Auto-generated spider ID (different each run)
$spider = new Spider($seed); // Spider ID is random each time
$cacheFilter = new CachedResourceFilter($resultsPath, 'some-other-id', $maxAge); // Won't find cache from previous run

Persistence Handler Compatibility

The cache filter is designed to work with file-based persistence handlers:

FileSerializedResourcePersistenceHandler ✓
FilePersistenceHandler ✓
MemoryPersistenceHandler ✗ (stores in memory, not disk)

Cache Directory Structure

The cache filter expects the same directory structure as the file persistence handlers:

{basePath}/{spiderId}/{hostname}/{path}/{encoded-filename}

For example:

cache/my-spider-id/example.com/path/to/page.html

How It Works

When a URI is discovered, the cache filter checks if a corresponding file exists in the cache directory
If the file exists, it checks the file's modification time
If the file is younger than $maxAgeSeconds (or $maxAgeSeconds is 0), the filter returns true to skip downloading
If the file doesn't exist or is too old, the filter returns false to allow downloading

Complete Example

See example/example_cache.php for a complete working example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CachedResourceFilter

Features

Usage

Basic Example

Configuration Options

Constructor Parameters

Max Age Behavior

Examples

Always use cache (archival mode)

Daily incremental crawls

Hourly updates

Important Notes

Spider ID Consistency

Persistence Handler Compatibility

Cache Directory Structure

How It Works

Complete Example

FilesExpand file tree

CachedResourceFilter.md

Latest commit

History

CachedResourceFilter.md

File metadata and controls

CachedResourceFilter

Features

Usage

Basic Example

Configuration Options

Constructor Parameters

Max Age Behavior

Examples

Always use cache (archival mode)

Daily incremental crawls

Hourly updates

Important Notes

Spider ID Consistency

Persistence Handler Compatibility

Cache Directory Structure

How It Works

Complete Example