This release is dedicated to Peter Porker of Earth-8311, an innocent pig raised by animal scientist May Porker. After a freak accident with the world's first atomic powered hairdryer, Peter was bitten by the scientist and transformed into a crime-fighting superhero pig.
New Additions
- Custom queries and multi-query reports can be defined in the Spidergram config files; Spidergram now ships with a handful of simple queries and an overview report as part of its core configuration.
- Spidergram can run an Axe Accessibility Report on every page as it crawls a site; this behavior can be turned on and off via the
spider.auditAccessiblityconfig property. - Spidergram can now save cookies, performance data, and remote API requests made during page load using the
config.spider.saveCookies,.savePerformance, and.saveXhrconfig properties. - Spidergram can identify and catalog design patterns during the post-crawl page analysis process; pattern definitions can also include rules for extracting pattern properties like a card's title and CTA link.
- Resources with attached downloads can be processed using file parsing plugins; Spidergram 0.10.0 comes with support for PDF and .docx content and metadata, image EXIF metadata, and audio/video metadata in a variety of formats.
- The
config.spider.seedsetting lets you set one or more URLs as the default starting points for crawling. - For large crawls, an experimental
config.offloadBodyHtmlsettings flag has been added to Spidergram's global configuration. When it's set to 'db', all body HTML will be stored in a dedicated key-value collection, rather than theresourcescollection. On sites with many large pages (50k+ pages of 500k+ html or more) this can significantly improve the speed of filtering, queries and reporting.
Changes
- Spidergram's CLI commands have been overhauled; vestigial commands from the 0.5.0 era have been removed and replaced. Of particular interest:
spidergram statussummarizes the current config and DB statespidergram initgenerates a fresh configuration file in the current directoryspidergram pingtests a remote URL using the current analysis settingsspidergram querydisplays and saves filtered snapshots of the saved crawl graphspidergram reportoutputs a collection of query results as a combined workbook or JSON filespidergram gocrawls one or more URLs, analyzes the crawled files, and generates a report in a single step.spidergram url testtests a URL against the current normalizer and filter settings.spidergram url treereplaces the oldurlscommand for building site hierarchies.
- CLI consistency is significantly improved. For example:
analyze,query,report, andurl treeall support the same--filtersyntax for controlling which records are loaded from the database.
Fixes and under-the-hood improvements
- URL matching and filtering has been smoothed out, and a host of tests have been added to ensure things stay solid. Previously, filter strings were treated as globs matched against the entire URL. Now,
{ property: 'hostname', glob: '*.foo.com' }objects can be used to explicitly specify glob orr regex matches against individual URL components.