The CachedResourceFilter is a prefetch filter that allows you to skip downloading resources that are already cached and younger than a specified maximum age. This is useful for incremental crawls where you want to avoid re-downloading recently fetched content.
- Skip downloading resources that are already cached on disk
- Configurable maximum age for cached resources
- Works with file-based persistence handlers
- Share cache across multiple spider runs by using the same spider ID
use VDB\Spider\Spider;
use VDB\Spider\Filter\Prefetch\CachedResourceFilter;
use VDB\Spider\PersistenceHandler\FileSerializedResourcePersistenceHandler;
// Use a fixed spider ID to share cache across runs
$spiderId = 'my-spider-cache';
$spider = new Spider('http://example.com', null, null, null, $spiderId);
// Set up file persistence
$resultsPath = __DIR__ . '/cache';
$spider->getDownloader()->setPersistenceHandler(
new FileSerializedResourcePersistenceHandler($resultsPath)
);
// Add the cache filter with a 1-hour max age
$maxAgeSeconds = 3600; // 1 hour
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, $maxAgeSeconds);
$spider->getDiscovererSet()->addFilter($cacheFilter);
// Crawl - cached resources within 1 hour will be skipped
$spider->crawl();$basePath(string): The base directory where spider results are stored$spiderId(string): The spider ID used for the cache directory (must match the persistence handler's spider ID)$maxAgeSeconds(int): Maximum age in seconds for cached resources. Set to0to always use cache regardless of age.
The $maxAgeSeconds parameter controls how the filter determines if a cached resource is "fresh":
$maxAgeSeconds > 0: Only skip downloading if the cached file's modification time is less than$maxAgeSecondsseconds old$maxAgeSeconds = 0: Always skip downloading if the file exists in cache, regardless of age (useful for archival crawls)
// Never re-download cached resources
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, 0);
$spider->getDiscovererSet()->addFilter($cacheFilter);// Re-download resources older than 24 hours
$maxAgeSeconds = 86400; // 24 hours
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, $maxAgeSeconds);
$spider->getDiscovererSet()->addFilter($cacheFilter);// Re-download resources older than 1 hour
$maxAgeSeconds = 3600; // 1 hour
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, $maxAgeSeconds);
$spider->getDiscovererSet()->addFilter($cacheFilter);For the cache to work across multiple runs, you must use the same spider ID:
// ✓ Correct: Fixed spider ID for cache sharing
$spiderId = 'my-consistent-id';
$spider = new Spider($seed, null, null, null, $spiderId);
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, $maxAge);
// ✗ Wrong: Auto-generated spider ID (different each run)
$spider = new Spider($seed); // Spider ID is random each time
$cacheFilter = new CachedResourceFilter($resultsPath, 'some-other-id', $maxAge); // Won't find cache from previous runThe cache filter is designed to work with file-based persistence handlers:
FileSerializedResourcePersistenceHandler✓FilePersistenceHandler✓MemoryPersistenceHandler✗ (stores in memory, not disk)
The cache filter expects the same directory structure as the file persistence handlers:
{basePath}/{spiderId}/{hostname}/{path}/{encoded-filename}
For example:
cache/my-spider-id/example.com/path/to/page.html
- When a URI is discovered, the cache filter checks if a corresponding file exists in the cache directory
- If the file exists, it checks the file's modification time
- If the file is younger than
$maxAgeSeconds(or$maxAgeSecondsis 0), the filter returnstrueto skip downloading - If the file doesn't exist or is too old, the filter returns
falseto allow downloading
See example/example_cache.php for a complete working example.