All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Potential issues found with PHPStan 2 on level 8.
- Removed the overriding
validateAndSanitizeInput()method from thePaginateHTTP step to ensure features likestaticUrl()anduseInputKeyAsUrl()work correctly. - The
PaginateHTTP step now also supports receiving an array of URLs, initiating pagination separately for each one.
- The
Crwlr\Crawler\Steps\Loading\Http\Paginateclass. It shall be removed and its behavior implemented in theHttpclass directly, in the next major version.
- An issue in the
SimpleWebsitePaginatorwhen used with stop rules.
- Issues with passing cookies from the cookie jar to the headless browser when using the
useBrowser()method onHttpsteps, in cases where the loader wasn’t globally configured to use the browser for all requests.
- The
Result::toArray()method now converts all objects contained in the Result array (at any level of the array) to arrays. Also, if the only element in a result array has some autogenerated key containing "unnamed", but the value also is an associative array with string keys, the method only returns that child array.
- An issue that occurred, when a step uses the
PreStepInvocationLogger. As refiners also use the logger, a newer logger (replacing thePreStepInvocationLogger) is now also passed to all registered refiners of a step. - Enable applying refiners to output properties with array value. E.g. if a step outputs an array of URLs (
['https://...', 'https://...']), aUrlRefinerwill be applied to all those URLs.
- Dynamically building request URLs from extracted data:
Httpsteps now have a newstaticUrl()method, and you can also use variables within that static URL - as well as in request headers and the body - likehttps://www.example.com/foo/[crwl:some_extracted_property]. These variables will be replaced with the corresponding properties from input data (also works with kept data). - New Refiners:
DateTimeRefiner::reformat('Y-m-d H:i:s')to reformat a date time string to a different format. Tries to automatically recognize the input format. If this does not work, you can provide an input format to use as the second argument.HtmlRefiner::remove('#foo')to remove nodes matching the given selector from selected HTML.
- Steps that produce multiple outputs per input can now group them per input by calling the new
Step::oneOutputPerInput()method.
- When feeding an
Httpstep with a string that is not a valid URL (e.g.https://), the exception when trying to parse it as a URL is caught, and an error logged.
- As sometimes, XML parsing errors occur because of characters that aren't valid within XML documents, the library now catches XML parsing errors, tries to find and replace invalid characters (with transliterates or HTML entities) and retries parsing the document. Works best when you additionally install the
voku/portable-asciicomposer package.
- When providing an empty base selector to an
Htmlstep (Html::each(''),Html::first(''),Html::last('')), it won't fail with an error, but instead log a warning, that it most likely doesn't make sense. - The
Step::keep()methods now also work when applied to child steps within a group step.
- Issue when using
Http::get()->useBrowser()->postBrowserNavigateHook(). Previously in this case, when the loader is configured to use the HTTP client, the post browser navigate hook was actually not set because of an issue with the order, things happened internally.
- Since, when using the Chrome browser for loading, we can only execute GET requests:
- The loader now automatically switches to the HTTP client for POST, PUT, PATCH, and DELETE requests and logs a warning.
- A warning is logged when attempting to use "Post Browser Navigate Hooks" with POST, PUT, PATCH, or DELETE requests.
- Consequently, the
useBrowser()method, introduced in v3.4.0, is also limited to GET requests.
- Two new methods to the base class of all
Httpsteps:skipCache()– Allows using the cache while skipping it for a specific loading step.useBrowser()– Switches the loader to use a (headless) Chrome browser for loading calls in a specific step and then reverts the loader to its previous setting.
- Introduced the new
BrowserAction::screenshot()post browser navigate hook. It accepts an instance of the newScreenshotConfigclass, allowing you to configure various options (see the methods ofScreenshotConfig). If successful, the screenshot file paths are included in theRespondedRequestoutput object of theHttpstep.
- New
BrowserActions to use with thepostBrowserNavigateHook()method:BrowserAction::clickInsideShadowDom()BrowserAction::moveMouseToElement()BrowserAction::moveMouseToPosition()BrowserAction::scrollDown()BrowserAction::scrollUp()BrowserAction::typeText()BrowserAction::waitForReload()
- A new method in
HeadlessBrowserLoaderHelperto include the HTML content of shadow DOM elements in the returned HTML. Use it like this:$crawler->getLoader()->browser()->includeShadowElementsInHtml().
- The
BrowserAction::clickElement()action, now automatically waits for an element matching the selector to be rendered, before performing the click. This means you don't need to put aBrowserAction::waitUntilDocumentContainsElement()before it. It works the same in the newBrowserAction::clickInsideShadowDom()andBrowserAction::moveMouseToElement()actions.
BrowserAction::clickElementAndWaitForReload()andBrowserAction::evaluateAndWaitForReload(). As a replacement, please useBrowserAction::clickElement()orBrowserAction::evaluate()andBrowserAction::waitForReload()separately.
- When a child step is nested in the
extract()method of anHtmlorXmlstep, and does not useeach()as the base, the extracted value is an array with the keys defined in theextract()call, rather than an array of such arrays as it would be witheach()as base.
- Trying to load a relative reference URI (no scheme and host/authority, only path) via the
HttpLoadernow immediately logs (or throws whenloadOrFail()is used) an error instead of trying to actually load it.
- Fix deprecation warning triggered in the
DomQueryclass, when trying to get the value of an HTML/XML attribute that does not exist on the element.
- Warnings about loader hooks being called multiple times, when using a
BotUserAgentand therefore loading and respecting the robots.txt file, or when using theHttp::stopOnErrorResponse()method.
- Reuse previously opened page when using the (headless) Chrome browser, instead of opening a new page for each request.
RespondedRequest::isServedFromCache()to determine whether a response was served from cache or actually loaded.
- Another improvement for getting XML source when using the browser, in cases where Chrome doesn't identify the response as an XML document (even though a Content-Type header is sent).
HttpLoader::dontUseCookies()now also works when using the Chrome browser. Cookies are cleared before every request.
- Further improve getting the raw response body from non-HTML documents via Chrome browser.
- When loading a non-HTML document (e.g., XML) via the Chrome browser, the library now retrieves the original source. Previously, it returned the outerHTML of the rendered document, which wrapped the content in an HTML structure.
- When the
validateAndSanitize()method of a step throws anInvalidArgumentException, the exception is now caught, logged and the step is not invoked with the invalid input. This improves fault tolerance. Feeding a step with one invalid input shouldn't cause the whole crawler run to fail. Exceptions other thanInvalidArgumentExceptionremain uncaught.
- New method
HeadlessBrowserLoaderHelper::setPageInitScript()($crawler->getLoader()->browser()->setPageInitScript()) to provide javascript code that is executed on every new browser page before navigating anywhere. - New method
HeadlessBrowserLoaderHelper::useNativeUserAgent()($crawler->getLoader()->browser()->useNativeUserAgent()) to allow using the nativeUser-Agentthat your Chrome browser sends by default.
- Minor improvement for the
DomQuery(base forDom::cssSelector()andDom::xPath()): enable providing an empty string as selector, to simply get the node that the selector is applied to.
- Improved fix for non UTF-8 characters in HTML documents declared as UTF-8.
- When the new PHP 8.4 DOM API is used, and HTML declared as UTF-8 contains non UTF-8 compatible characters, it does not replace them with a � character, but instead removes it. This behaviour is consistent with the data returned by Symfony DomCrawler.
- Removed deprecations for all XPath functionality (
Dom::xPath(),XPathQueryclass andNode::queryXPath()), because it's still available with the net DOM API in PHP 8.4.
The primary change in version 3.0.0 is that the library now leverages PHP 8.4’s new DOM API when used in an environment with PHP >= 8.4. To maintain compatibility with PHP < 8.4, an abstraction layer has been implemented. This layer dynamically uses either the Symfony DomCrawler component or the new DOM API, depending on the PHP version.
Since no direct interaction with an instance of the Symfony DomCrawler library was required at the step level provided by the library, it is highly likely that you won’t need to make any changes to your code to upgrade to v3. To ensure a smooth transition, please review the points under “Changed.”
- BREAKING: The
DomQuery::innerText()method (a.k.a.Dom::cssSelector('...')->innerText()) has been removed.innerTextexists only in the Symfony DomCrawler component, and its usefulness is questionable. If you still require this variant of the DOM element text, please let us know or create a pull request yourself. Thank you! - BREAKING: The
DomQueryInterfacewas removed. As theDomQueryclass offers a lot more functionality than the interface defines, the purpose of the interface was questionable. Please use the abstractDomQueryclass instead. This also means that some method signatures, type hinting the interface, have changed. Look for occurrences ofDomQueryInterfaceand replace them. - BREAKING: The visibility of the
DomQuery::filter()method was changed from public to protected. It is still needed in theDomQueryclass, but outside of it, it is probably better and easier to directly use the new DOM abstraction (see thesrc/Steps/Domdirectory). If you are extending theDomQueryclass (which is not recommended), be aware that the argument now takes aNode(from the new DOM abstraction) instead of a SymfonyCrawler. - BREAKING: The
Step::validateAndSanitizeToDomCrawlerInstance()method was removed. Please use theStep::validateAndSanitizeToHtmlDocumentInstance()andStep::validateAndSanitizeToXmlDocumentInstance()methods instead. - BREAKING: The second argument in
Closures passed to theHttp::crawl()->customFilter()has changed from an instance of SymfonyCrawlerclass, to anHtmlElementinstance from the new DOM abstraction (Crwlr\Crawler\Steps\Dom\HtmlElement). - BREAKING: The Filter class was split into
AbstractFilter(base class for actual filter classes) andFilteronly hosting the static function for easy instantiation, because otherwise each filter class also has all the static methods. - BREAKING: Further, the signatures of some methods that are mainly here for internal usage, have changed due to the new DOM abstraction:
- The static
GetLink::isSpecialNonHttpLink()method now needs an instance ofHtmlElementinstead of a SymfonyCrawler. GetUrlsFromSitemap::fixUrlSetTag()now takes anXmlDocumentinstead of a SymfonyCrawler.- The
DomQuery::apply()method now takes aNodeinstead of a SymfonyCrawler.
- The static
Dom::xPath()method and- the
XPathQueryclass as well as - the new
Node::queryXPath()method.
- New step output filter
Filter::arrayHasElement(). When a step produces array output with a property being a numeric array, you can now filter outputs by checking if one element of that array property, matches certain filter criteria. Example: The outputs look like['foo' => 'bar', 'baz' => ['one', 'two', 'three']]. You can filter all outputs wherebazcontainstwolike:Filter::arrayHasElement()->where('baz', Filter::equal('two')).
- Improvements for deprecations in PHP 8.4.
- Issue when converting cookie objects received from the chrome-php library.
- Also add cookies, set during headless browser usage, to the cookie jar. When switching back to the (guzzle) HTTP client the cookies should also be sent.
- Don't call
Loader::afterLoad()whenLoader::beforeLoad()was not called before. This can potentially happen, when an exception is thrown before the call to thebeforeLoadhook, but it is caught and theafterLoaderhook method is called anyway. As this most likely won't make sense to users, theafterLoadhook callback functions will just not be called in this case. - The
Throttlerclass now has protected methods_internalTrackStartFor(),_requestToUrlWasStarted()and_internalTrackEndFor(). When extending theThrottlerclass (be careful, actually that's not really recommended) they can be used to check if a request to a URL was actually started before.
- The new
postBrowserNavigateHook()method in theHttpstep classes, which allows to define callback functions that are triggered after the headless browser navigated to the specified URL. They are called with the chrome-phpPageobject as argument, so you can interact with the page. Also, there is a new classBrowserActionproviding some simple actions (like wait for element, click element,...) as Closures via static methods. You can use it likeHttp::get()->postBrowserNavigateHook(BrowserAction::clickElement('#element')).
- Issue with the
afterLoadhook of theHttpLoader, introduced in v2. Calling the hook was commented out, which slipped through because the test case was faulty.
- BREAKING: Removed methods
BaseStep::addToResult(),BaseStep::addLaterToResult(),BaseStep::addsToOrCreatesResult(),BaseStep::createsResult(), andBaseStep::keepInputData(). These methods were deprecated in v1.8.0 and should be replaced withStep::keep(),Step::keepAs(),Step::keepFromInput(), andStep::keepInputAs(). - BREAKING: Added the following keep methods to the
StepInterface:StepInterface::keep(),StepInterface::keepAs(),StepInterface::keepFromInput(),StepInterface::keepInputAs(), as well asStepInterface::keepsAnything(),StepInterface::keepsAnythingFromInputData()andStepInterface::keepsAnythingFromOutputData(). If you have a class that implements this interface without extendingStep(orBaseStep), you will need to implement these methods yourself. However, it is strongly recommended to extendStepinstead. - BREAKING: With the removal of the
addToResult()method, the library no longer usestoArrayForAddToResult()methods on output objects. Instead, please usetoArrayForResult(). Consequently,RespondedRequest::toArrayForAddToResult()has been renamed toRespondedRequest::toArrayForResult(). - BREAKING: Removed the
resultandaddLaterToResultproperties fromIoobjects (InputandOutput). These properties were part of theaddToResultfeature and are now removed. Instead, use thekeepproperty where kept data is added. - BREAKING: The signature of the
Crawler::addStep()method has changed. You can no longer provide a result key as the first parameter. Previously, this key was passed to theStep::addToResult()method internally. Now, please handle this call yourself. - BREAKING: The return type of the
Crawler::loader()method no longer allowsarray. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the new functionality to directly provide a custom loader to a step described below. As part of this change, theUnknownLoaderKeyExceptionwas also removed as it is now obsolete. If you have any references to this class, please make sure to remove them. - BREAKING: Refactored the abstract
LoadingStepclass to a trait and removed theLoadingStepInterface. Loading steps should now extend theStepclass and use the trait. As multiple loaders are no longer supported, theaddLoadermethod was renamed tosetLoader. Similarly, the methodsuseLoader()andusesLoader()for selecting loaders by key are removed. Now, you can directly provide a different loader to a single step using the trait's newwithLoader()method (e.g.,Http::get()->withLoader($loader)). The trait now also uses phpdoc template tags, for a generic loader type. You can define the loader type by putting/** @use LoadingStep<MyLoader> */aboveuse LoadingStep;in your step class. Then your IDE and static analysis (if supported) will know what type of loader, the trait methods return and accept. - BREAKING: Removed the
PaginatorInterfaceto allow for better extensibility. The oldCrwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginatorclass has also been removed. Please use the newer, improved versionCrwlr\Crawler\Steps\Loading\Http\AbstractPaginator. This newer version has also changed: the first argumentUriInterface $urlis removed from theprocessLoaded()method, as the URL also is part of the request (Psr\Http\Message\RequestInterface) which is now the first argument. Additionally, the default implementation of thegetNextRequest()method is removed. Child implementations must define this method themselves. If your custom paginator still has agetNextUrl()method, note that it is no longer needed by the library and will not be called. ThegetNextRequest()method now fulfills its original purpose. - BREAKING: Removed methods from
HttpLoader:$loader->setHeadlessBrowserOptions()=> use$loader->browser()->setOptions()instead$loader->addHeadlessBrowserOptions()=> use$loader->browser()->addOptions()instead$loader->setChromeExecutable()=> use$loader->browser()->setExecutable()instead$loader->browserHelper()=> use$loader->browser()instead
- BREAKING: Removed method
RespondedRequest::cacheKeyFromRequest(). UseRequestKey::from()instead. - BREAKING: The
HttpLoader::retryCachedErrorResponses()method now returns an instance of the newCrwlr\Crawler\Loader\Http\Cache\RetryManagerclass. This class provides the methodsonly()andexcept()to restrict retries to specific HTTP response status codes. Previously, this method returned theHttpLoaderitself ($this), so if you're using it in a chain and calling other loader methods after it, you will need to refactor your code. - BREAKING: Removed the
Microsecondsclass from this package. It has been moved to thecrwlr/utilspackage, which you can use instead.
- New methods
FileCache::prolong()andFileCache::prolongAll()to allow prolonging the time to live for cached responses.
- The
maxOutputs()method is now also available and working onGroupsteps. - Improved warning messages for step validations that are happening before running a crawler.
- A
PreRunValidationExceptionwhen the crawler finds a problem with the setup, before actually running, is not only logged as an error via the logger, but also rethrown to the user. This way the user won't get the impression, that the crawler ran successfully without looking at the log messages.
- URL refiners:
UrlRefiner::withScheme(),UrlRefiner::withHost(),UrlRefiner::withPort(),UrlRefiner::withoutPort(),UrlRefiner::withPath(),UrlRefiner::withQuery(),UrlRefiner::withoutQuery(),UrlRefiner::withFragment()andUrlRefiner::withoutFragment(). - New paginator stop rules
PaginatorStopRules::contains()andPaginatorStopRules::notContains(). - Static method
UserAgent::mozilla5CompatibleBrowser()to get aUserAgentinstance with the user agent stringMozilla/5.0 (compatible)and also the new methodwithMozilla5CompatibleUserAgentin theAnonymousHttpCrawlerBuilderthat you can use like this:HttpCrawler::make()->withMozilla5CompatibleUserAgent().
- Prevent PHP warnings when an HTTP response includes a
Content-Type: application/x-gzipheader, but the content is not actually compressed. This issue also occurred with cached responses, because compressed content is decoded during caching. Upon retrieval from the cache, the header indicated compression, but the content was already decoded.
- When using
HttpLoader::cacheOnlyWhereUrl()to restrict caching, the filter rule is not only applied when adding newly loaded responses to the cache, but also for using cached responses. Example: a response forhttps://www.example.com/foois already available in the cache, but$loader->cacheOnlyWhereUrl(Filter::urlPathStartsWith('/bar/'))was called, the cached response is not used.
- Add
HttpLoader::browser()as a replacement forHttpLoader::browserHelper()and deprecate thebrowserHelper()method. It's an alias and just because it will read a little better:$loader->browser()->xyz()vs.$loader->browserHelper()->xyz().HttpLoader::browserHelper()will be removed in v2.0. - Also deprecate
HttpLoader::setHeadlessBrowserOptions(),HttpLoader::addHeadlessBrowserOptions()andHttpLoader::setChromeExecutable(). Use$loader->browser()->setOptions(),$loader->browser()->addOptions()and$loader->browser()->setExecutable()instead.
- Issue with setting the headless chrome executable, introduced in 1.9.0.
- Also add
HeadlessBrowserLoaderHelper::getTimeout()to get the currently configured timeout value.
- New methods
HeadlessBrowserLoaderHelper::setTimeout()andHeadlessBrowserLoaderHelper::waitForNavigationEvent()to allow defining the timeout for the headless chrome in milliseconds (default 30000 = 30 seconds) and the navigation event (load(default),DOMContentLoaded,firstMeaningfulPaint,networkIdle, etc.) to wait for when loading a URL.
- New methods
Step::keep()andStep::keepAs(), as well asStep::keepFromInput()andStep::keepInputAs(), as alternatives toStep::addToResult()(orStep::addLaterToResult()). Thekeep()method can be called without any argument, to keep all from the output data. It can be called with a string, to keep a certain key or with an array to keep a list of keys. If the step yields scalar value outputs (not an associative array or object with keys) you need to use thekeepAs()method with the key you want the output value to have in the kept data. The methodskeepFromInput()andkeepInputAs()work the same, but uses the input (not the output) that the step receives. Most likely only needed with a first step, to keep data from initial inputs (or in a sub crawler, see below). Kept properties can also be accessed with theStep::useInputKey()method, so you can easily reuse properties from multiple steps ago as input. - New method
Step::outputType()with default implementation returningStepOutputType::Mixed. Please consider implementing this method yourself in all your custom steps, because it is going to be required in v2 of the library. It allows detecting (potential) problems in crawling procedures immediately when starting a run instead of failing after already running a while. - New method
Step::subCrawlerFor(), allowing to fill output properties from an actual full child crawling procedure. As the first argument, you give it a key from the step's output, that the child crawler uses as input(s). As the second argument you need to provide aClosurethat receives a clone of the currentCrawlerwithout steps and with initial inputs, set from the current output. In theClosureyou then define the crawling procedure by adding steps as you're used to do it, and return it. This allows to achieve nested output data, scraped from different (sub-)pages, more flexible and less complicated as with the usual linear crawling procedure andStep::addToResult().
- The
Step::addToResult(),Step::addLaterToResult()andStep::keepInputData()methods. Instead, please use the new keep methods. This can cause some migration work for v2, because especially the add to result methods are a pretty central functionality, but the new "keep" methodology (plus the new sub crawler feature) will make a lot of things easier, less complex and the library will most likely work more efficiently in v2.
- When a cache file was generated with compression, and you're trying to read it with a
FileCacheinstance without compression enabled, it also works. When unserializing the file content fails it tries decoding the string first before unserializing it.
- When the
useInputKey()method is used on a step and the defined key does not exist in input, it logs a warning and does not invoke the step instead of throwing anException.
- A PHP error that happened when the loader returns
nullfor the initial request in theHttp::crawl()step.
- Allow getting the whole decoded JSON as array with the new
Json::all()and also allow to get the whole decoded JSON, when usingJson::get(), inside a mapping using either empty string or*as target. Example:Json::get(['all' => '*']).*only works, when there is no key*in the decoded data.
- Make it work with responses loaded by a headless browser. If decoding the input string fails, it now checks if it could be HTML. If that's the case, it extracts the text content of the
<body>and tries to decode this instead.
- When using
HttpLoader::cacheOnlyWhereUrl()and a request was redirected (maybe even multiple times), previously all URLs in the chain had to match the filter rule. As this isn't really practicable, now only one of the URLs has to match the rule.
- Make method
HttpLoader::addToCache()public, so steps can update a cached response with an extended version.
- Enable dot notation in
Step::addToResult(), so you can get data from nested output, like:$step->addToResult(['url' => 'response.url', 'status' => 'response.status', 'foo' => 'bar']). - When a step adds output properties to the result, and the output contains objects, it tries to serialize those objects to arrays, by calling
__serialize(). If you want an object to be serialized differently for that purpose, you can define atoArrayForAddToResult()method in that class. When that method exists, it's preferred to the__serialize()method. - Implemented above-mentioned
toArrayForAddToResult()method in theRespondedRequestclass, so on every step that somehow yields aRespondedRequestobject, you can use the keysurl,uri,status,headersandbodywith theaddToResult()method. Previously this only worked forHttpsteps, because it defines output key aliases (HttpBase::outputKeyAliases()). Now, in combination with the ability to use dot notation when adding data to the result, if your custom step returns nested output like['response' => RespondedRequest, 'foo' => 'bar'], you can add response data to the result like this$step->addToResult(['url' => 'response.url', 'body' => 'response.body']).
- Improvement regarding the timing when a store (
Storeclass instance) is called by the crawler with a final crawling result. When a crawling step initiates a crawling result (so,addToResult()was called on the step instance), the crawler has to wait for all child outputs (resulting from one step-input) until it calls the store, because the child outputs can all add data to the same final result object. But previously this was not only the case for all child outputs starting from a step whereaddToResult()was called, but all children of one initial crawler input. So with this change, in a lot of cases, the store will earlier be called with finishedResultobjects and memory usage will be lowered.
- Merge
HttpBaseLoaderback toHttpLoader. It's probably not a good idea to have multiple loaders. At least not multiple loaders just for HTTP. It should be enough to publicly expose theHeadlessBrowserLoaderHelperviaHttpLoader::browserHelper()for the extension steps. But keep theHttpBasestep, to share the general HTTP functionality implemented there.
- Issue in
GetUrlsFromSitemap(Sitemap::getUrlsFromSitemap()) step when XML content has no line breaks.
- For being more flexible to build a separate headless browser loader (in an extension package) extract the most basic HTTP loader functionality to a new
HttpBaseLoaderand important functionality for the headless browser loader to a newHeadlessBrowserLoaderHelper. Further, also share functionality from theHttpsteps via a new abstractHttpBasestep. It's considered a fix, because there's no new functionality, just refactoring existing code for better extendability.
- The
DomQueryclass (parent ofCssSelector(Dom::cssSelector) andXPathQuery(Dom::xPath)) has a new methodformattedText()that uses the new crwlr/html-2-text package to convert the HTML to formatted plain text. You can also provide a customized instance of theHtml2Textclass to theformattedText()method.
- The
Http::crawl()step won't yield a page again if a newly found URL responds with a redirect to a previously loaded URL.
- The
QueryParamsPaginatorcan now also increase and decrease non first level query param values likefoo[bar][baz]=5using dot notation:QueryParamsPaginator::paramsInUrl()->increaseUsingDotNotation('foo.bar.baz', 5).
- The
FileCachecan now also read uncompressed cache files when compression is activated.
- Reset paginator state after finishing paginating for one base input, to enable paginating multiple listings of the same structure.
- Add forgotten getter method to get the DOM query that is attached to an
InvalidDomQueryExceptioninstance.
- When creating a
CssSelectororXPathQueryinstance with invalid selector/query syntax, anInvalidDomQueryExceptionis now immediately thrown. This change is considered to be not only non-breaking, but actually a fix, because theCssSelectorwould otherwise throw an exception later when theapply()method is called. TheXPathQuerywould silently return no result without notifying you of the invalid query and generate a PHP warning.
- Support usage with the new Symfony major version v7.
- New methods
HttpLoader::useProxy()andHttpLoader::useRotatingProxies([...])to define proxies that the loader shall use. They can be used with a guzzle HTTP client instance (default) and when the loader uses the headless Chrome browser. Using them when providing some other PSR-18 implementation will throw an exception. - New
QueryParamsPaginatorto paginate by increasing and/or decreasing one or multiple query params, either in the URL or in the body of requests. Can be created via static methodCrwlr\Crawler\Steps\Loading\Http\Paginator::queryParams(). - New method
stopWhenin the newCrwlr\Crawler\Steps\Loading\Http\AbstractPaginatorclass (for more info see the deprecation below). You can pass implementations of the newStopRuleinterface or custom closures to that method and then, every time the Paginator receives a loaded response to process, those stop rules are called with the response. If any of the conditions of the stop rules is met, the Paginator stops paginating. Of course also added a few stop rules to use with that new method:IsEmptyInHtml,IsEmptyInJson,IsEmptyInXmlandIsEmptyResponse, also available via static methods:PaginatorStopRules::isEmptyInHtml(),PaginatorStopRules::isEmptyInJson(),PaginatorStopRules::isEmptyInXml()andPaginatorStopRules::isEmptyResponse().
- Deprecated the
Crwlr\Crawler\Steps\Loading\Http\PaginatorInterfaceand theCrwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator. Instead, added a new version of theAbstractPaginatorasCrwlr\Crawler\Steps\Loading\Http\AbstractPaginatorthat can be used. Usually there shouldn't be a problem switching from the old to the new version. If you want to make your custom paginator implementation ready for v2 of the library, extend the newAbstractPaginatorclass, implement your owngetNextRequestmethod (new requirement, with a default implementation in the abstract class, which will be removed in v2) and check if properties and methods of your existing class don't collide with the new properties and methods in the abstract class.
- The
HttpLoader::load()implementation won't throw any exception, because it shouldn't kill a crawler run. When you want any loading error to end the whole crawler executionHttpLoader::loadOrFail()should be used. Also adapted the phpdoc in theLoaderInterface.
- Fix in
HttpCrawl(Http::crawl()) step: when a page contains a broken link, that can't be resolved and throws anExceptionfrom the URL library, ignore the link and log a warning message. - Minor fix for merging HTTP headers when an
Httpstep gets both, statically defined headers and headers to use from array input.
- When a URL redirects, the
trackRequestEndFor()method of theHttpLoader'sThrottlerinstance is called only once at the end and with the original request URL.
- New
onCacheHithook in theLoaderclass (in addition tobeforeLoad,onSuccess,onErrorandafterLoad) that is called in theHttpLoaderclass when a response for a request was found in the cache.
- Moved the
Microsecondsvalue object class to the crwlr/utils package, as it is a very useful and universal tool. The class in this package still exists, but just extends the class from the utils package and will be removed in v2. So, if you're using this class, please change to use the version from the utils package.
- Throttling now also works when using the headless browser.
- The
Http::crawl()step, as well as theHtml::getLink()andHtml::getLinks()steps now ignore links, when thehrefattribute starts withmailto:,tel:orjavascript:. For the crawl step it obviously makes no sense, but it's also considered a bugfix for the getLink(s) steps, because they are meant to deliver absolute HTTP URLs. If you want to get the values of such links, use the HTML data extraction step.
- The
Http::crawl()step now also work with sitemaps as input URL, where the<urlset>tag contains attributes that would cause the symfony DomCrawler to not find any elements.
- Improved
Jsonstep: if the target of the "each" (likeJson::each('target', [...])) does not exist in the input JSON data, the step yields nothing and logs a warning.
- Using the
only()method of theMetaData(Html::metaData()) step class, thetitleproperty was always contained in the output, even if not listed in theonlyproperties. This is fixed now.
- There was an issue when adding multiple associative arrays with the same key to a
Resultobject: let's say you're having a step producing array output like:['bar' => 'something', 'baz' => 'something else']and it (the whole array) shall be added to the result propertyfoo. When the step produced multiple such array outputs, that led to a result like['bar' => '...', 'baz' => '...', ['bar' => '...', 'baz' => '...'], ['bar' => '...', 'baz' => '...']. Now it's fixed to result in[['bar' => '...', 'baz' => '...'], ['bar' => '...', 'baz' => '...'], ['bar' => '...', 'baz' => '...'].
Httpsteps can now receive body and headers from input data (instead of statically defining them via argument likeHttp::method(headers: ...)) using the new methodsuseInputKeyAsBody(<key>)anduseInputKeyAsHeader(<key>, <asHeader>)oruseInputKeyAsHeaders(<key>). Further, when invoked with associative array input data, the step will by default use the value fromurlorurifor the request URL. If the input array contains the URL in a key with a different name, you can use the newuseInputKeyAsUrl(<key>)method. That was basically already possible with the existinguseInputKey(<key>)method, because the URL is the main input argument for the step. But if you want to use it in combination with the other newuseInputKeyAsXyz()methods, you have to useuseInputKeyAsUrl(), because usinguseInputKey(<key>)would invoke the whole step with that key only.Crawler::runAndDump()as a simple way to just run a crawler and dump all results, each as an array.addToResult()now also works with serializable objects.- If you know certain keys that the output of a step will contain, you can now also define aliases for those keys, to be used with
addToResult(). The output of anHttpstep (RespondedRequest) contains the keysrequestUriandeffectiveUri. The aliasesurlandurirefer toeffectiveUri, soaddToResult(['url'])will add theeffectiveUriasurlto the result object. - The
GetLink(Html::getLink()) andGetLinks(Html::getLinks()) steps, as well as the abstractDomQuery(parent ofCssSelector(/Dom::cssSelector) andXPathQuery(/Dom::xPath)) now have a methodwithoutFragment()to get links respectively URLs without their fragment part. - The
HttpCrawlstep (Http::crawl()) has a new methoduseCanonicalLinks(). If you call it, the step will not yield responses if its canonical link URL was already yielded. And if it discovers a link, and some document pointing to that URL via canonical link was already loaded, it treats it as if it was already loaded. Further this feature also sets the canonical link URL as theeffectiveUriof the response. - All filters can now be negated by calling the
negate()method, so theevaluate()method will return the opposite bool value when called. Thenegate()method returns an instance ofNegatedFilterthat wraps the original filter. - New method
cacheOnlyWhereUrl()in theHttpLoaderclass, that takes an instance of theFilterInterfaceas argument. If you define one or multiple filters using this method, the loader will cache only responses for URLs that match all the filters.
- The
HttpCrawlstep (Http::crawl()) by default now removes the fragment part of URLs to not load the same page multiple times, because in almost any case, servers won't respond with different content based on the fragment. That's why this change is considered non-breaking. For the rare cases when servers respond with different content based on the fragment, you can call the newkeepUrlFragment()method of the step. - Although the
HttpCrawlstep (Http::crawl()) already respected the limit of outputs defined via themaxOutputs()method, it actually didn't stop loading pages. The limit had no effect on loading, only on passing on outputs (responses) to the next step. This is fixed in this version. - A so-called byte order mark at the beginning of a file (/string) can cause issues. So just remove it, when a step's input string starts with a UTF-8 BOM.
- There seems to be an issue in guzzle when it gets a PSR-7 request object with a header with multiple string values (as array, like:
['accept-encoding' => ['gzip', 'deflate', 'br']]). When testing it happened that it only sent the last part (in this casebr). Therefore, theHttpLoadernow prepares headers before sending (in this case to:['accept-encoding' => ['gzip, deflate, br']]). - You can now also use the output key aliases when filtering step outputs. You can even use keys that are only present in the serialized version of an output object.
- JSON step: another fix for JSON strings having keys without quotes with empty string value.
- JSON step: improve attempt to fix JSON string having keys without quotes.
- New method
Step::refineOutput()to manually refine step output values. It takes either aClosureor an instance of the newRefinerInterfaceas argument. If the step produces array output, you can provide a key from the array output, to refine, as first argument and the refiner as second argument. You can call the method multiple times and all the refiners will be applied to the outputs in the order you add them. If you want to refine multiple output array keys with aClosure, you can skip providing a key and theClosurewill receive the full output array for refinement. As mentioned you can provide an instance of theRefinerInterface. There are already a few implementations:StringRefiner::afterFirst(),StringRefiner::afterLast(),StringRefiner::beforeFirst(),StringRefiner::beforeLast(),StringRefiner::betweenFirst(),StringRefiner::betweenLast()andStringRefiner::replace(). - New method
Step::excludeFromGroupOutput()to exclude a normal steps output from the combined output of a group that it's part of. - New method
HttpLoader::setMaxRedirects()to customize the limit of redirects to follow. Works only when using the HTTP client. - New filters to filter by string length, with the same options as the comparison filters (equal, not equal, greater than,...).
- New
Filter::custom()that you can use with a Closure, so you're not limited to the available filters only. - New method
DomQuery::link()as a shortcut forDomQuery::attribute('href')->toAbsoluteUrl(). - New static method
HttpCrawler::make()returning an instance of the new classAnonymousHttpCrawlerBuilder. This makes it possible to create your own Crawler instance with a one-liner like:HttpCrawler::make()->withBotUserAgent('MyCrawler'). There's also awithUserAgent()method to create an instance with a normal (non bot) user agent.
- BREAKING: The
FileCachenow also respects thettl(time to live) argument and by default it is one hour (3600 seconds). If you're using the cache and expect the items to live (basically) forever, please provide a high enough value for default the time to live. When you try to get a cache item that is already expired, it (the file) is immediately deleted. - BREAKING: The
TooManyRequestsHandler(and with that also the constructor argument in theHttpLoader) was renamed toRetryErrorResponseHandler. It now reacts the same to 503 (Service Unavailable) responses as to the 429 (Too Many Requests) responses. If you're actively passing your own instance to theHttpLoader, you need to update it. - You can now have multiple different loaders in a
Crawler. To use this, return an array containing your loaders from the protectedCrawler::loader()method with keys to name them. You can then selectively use them by calling theStep::useLoader()method on a loading step with the key of the loader it should use.
- BREAKING: The loop feature. The only real world use case should be paginating listings and this should be solved with the Paginator feature.
- BREAKING:
Step::dontCascade()andStep::cascades()because with the change in v0.7, that groups can only produce combined output, there should be no use case for this anymore. If you want to exclude one steps output from the combined group output, you can use the newStep::excludeFromGroupOutput()method.
- New functionality to paginate: There is the new
Paginatechild class of theHttpstep class (easy access viaHttp::get()->paginate()). It takes an instance of thePaginatorInterfaceand uses it to iterate through pagination links. There is one implementation of that interface, theSimpleWebsitePaginator. TheHttp::get()->paginate()method uses it by default, when called just with a CSS selector to get pagination links. Paginators receive all loaded pages and implement the logic to find pagination links. The paginator class is also called before sending a request, with the request object that is about to be sent as an argument (prepareRequest()). This way, it should even be doable to implement more complex pagination functionality. For example when pagination is built using POST request with query strings in the request body. - New methods
stopOnErrorResponse()andyieldErrorResponses()that can be used withHttpsteps. By callingstopOnErrorResponse()the step will throw aLoadingExceptionwhen a response has a 4xx or 5xx status code. By calling theyieldErrorResponse()even error responses will be yielded and passed on to the next steps (this was default behaviour until this version. See the breaking change below). - The body of HTTP responses with a
Content-Typeheader containingapplication/x-gzipare automatically decoded whenHttp::getBodyString()is used. Therefore, addedext-zlibto suggested incomposer.json. - New methods
addToResult()andaddLaterToResult().addToResult()is a single replacement forsetResultKey()andaddKeysToResult()(they are removed, seeChangedbelow) that can be used for array and non array output.addLaterToResult()is a new method that does not create a Result object immediately, but instead adds the output of the current step to all the Results that will later be created originating from the current output. - New methods
outputKey()andkeepInputData()that can be used with any step. Using theoutputKey()method, the step will convert non array output to an array and use the key provided as an argument to this method as array key for the output value. ThekeepInputData()method allows you to forward data from the step's input to the output. If the input is non array you can define a key using the method's argument. This is useful e.g. if you're having data in the initial inputs that you also want to add to the final crawling results. - New method
createsResult()that can be used with any step, so you can differentiate if a step creates a Result object, or just keeps data to add to results later (newaddLaterToResult()method). But primarily relevant for library internal use. - The
FileCacheclass can compress the cache data now to save disk space. Use theuseCompression()method to do so. - New method
retryCachedErrorResponses()inHttpLoader. When called, the loader will only use successful responses (status code < 400) from the cache and therefore retry already cached error responses. - New method
writeOnlyCache()inHttpLoaderto only write to, but don't read from the response cache. Can be used to renew cached responses. Filter::urlPathMatches()to filter URL paths using a regex.- Option to provide a chrome executable name to the
chrome-php/chromelibrary viaHttpLoader::setChromeExecutable().
- BREAKING: Group steps can now only produce combined outputs, as previously done when
combineToSingleOutput()method was called. The method is removed. - BREAKING:
setResultKey()andaddKeysToResult()are removed. Calls to those methods can both be replaced with calls to the newaddToResult()method. - BREAKING:
getResultKey()is also removed withsetResultKey(). It's removed without replacement, as it doesn't really make sense any longer. - BREAKING: Error responses (4xx as well as 5xx), by default, won't produce any step outputs any longer. If you want to receive error responses, use the new
yieldErrorResponses()method. - BREAKING: Removed the
httpClient()method in theHttpCrawlerclass. If you want to provide your own HTTP client, implement a customloadermethod passing your client to theHttpLoaderinstead. - Deprecated the loop feature (class
LoopandCrawler::loop()method). Probably the only use case is iterating over paginated list pages, which can be done using the new Paginator functionality. It will be removed in v1.0. - In case of a 429 (Too Many Requests) response, the
HttpLoadernow automatically waits and retries. By default, it retries twice and waits 10 seconds for the first retry and a minute for the second one. In case the response also contains aRetry-Afterheader with a value in seconds, it complies to that. Exception: by default it waits at max60seconds (you can set your own limit if you want), if theRetry-Aftervalue is higher, it will stop crawling. If all the retries also receive a429it also throws an Exception. - Removed logger from
Throttleras it doesn't log anything. - Fail silently when
robots.txtcan't be parsed. - Default timeout configuration for the default guzzle HTTP client:
connect_timeoutis10seconds andtimeoutis60seconds. - The
validateAndSanitize...()methods in the abstractStepclass, when called with an array with one single element, automatically try to use that array element as input value. - With the
HtmlandXmldata extraction steps you can now add layers to the data that is being extracted, by just adding furtherHtml/Xmldata extraction steps as values in the mapping array that you pass as argument to theextract()method. - The base
Httpstep can now also be called with an array of URLs as a single input. Crawl and Paginate steps still require a single URL input.
- The
CookieJarnow also works withlocalhostor other hosts without a registered domain name. - Improve the
Sitemap::getUrlsFromSitemap()step to also work when the<urlset>tag contains attributes that would cause the symfony DomCrawler to not find any elements. - Fixed possibility of infinite redirects in
HttpLoaderby adding a redirects limit of 10.
- New step
Http::crawl()(classHttpCrawlextending the normalHttpstep class) for conventional crawling. It loads all pages of a website (same host or domain) by following links. There's also a lot of options like depth, filtering by paths, and so on. - New steps
Sitemap::getSitemapsFromRobotsTxt()(GetSitemapsFromRobotsTxt) andSitemap::getUrlsFromSitemap()(GetUrlsFromSitemap) to get sitemap (URLs) from a robots.txt file and to get all the URLs from those sitemaps. - New step
Html::metaData()to get data from meta tags (and title tag) in HTML documents. - New step
Html::schemaOrg()(SchemaOrg) to get schema.org structured data in JSON-LD format from HTML documents. - The abstract
DomQueryclass (parent of theCssSelectorandXPathQueryclasses) now has some methods to narrow the selected matches further:first(),last(),nth(n),even(),odd().
- BREAKING: Removed
PoliteHttpLoaderand traitsWaitPolitelyandCheckRobotsTxt. Converted the traits to classesThrottlerandRobotsTxtHandlerwhich are dependencies of theHttpLoader. TheHttpLoaderinternally gets default instances of those classes. TheRobotsTxtHandlerwill respect robots.txt rules by default if you use aBotUserAgentand it won't if you use a normalUserAgent. You can access the loader'sRobotsTxtHandlerviaHttpLoader::robotsTxt(). You can pass your own instance of theThrottlerto the loader and also access it viaHttpLoader::throttle()to change settings.
- Getting absolute links via the
GetLinkandGetLinkssteps and thetoAbsoluteUrl()method of theCssSelectorandXPathQueryclasses, now also look for<base>tags in HTML when resolving the URLs. - The
SimpleCsvFileStorecan now also save results with nested data (but only second level). It just concatenates the values separated with a|.
- You can now call the new
useHeadlessBrowsermethod on theHttpLoaderclass to use a headless Chrome browser to load pages. This is enough to get HTML after executing javascript in the browser. For more sophisticated tasks a separate Loader and/or Steps should better be created. - With the
maxOutputs()method of the abstractStepclass you can now limit how many outputs a certain step should yield at max. That's for example helpful during development, when you want to run the crawler only with a small subset of the data/requests it will actually have to process when you eventually remove the limits. When a step has reached its limit, it won't even call theinvoke()method any longer until the step is reset after a run. - With the new
outputHook()method of the abstractCrawlerclass you can set a closure that'll receive all the outputs from all the steps. Should be only for debugging reasons. - The
extract()method of theHtmlandXml(children ofDom) steps now also works with a single selector instead of an array with a mapping. Sometimes you'll want to just get a simple string output e.g. for a next step, instead of an array with mapped extracted data. - In addition to
uniqueOutputs()there is now alsouniqueInputs(). It works exactly the same asuniqueOutputs(), filtering duplicate input values instead. Optionally also by a key when expected input is an array or an object. - In order to be able to also get absolute links when using the
extract()method of Dom steps, the abstractDomQueryclass now has a methodtoAbsoluteUrl(). The Dom step will automatically provide theDomQueryinstance with the base url, presumed that the input was an instance of theRespondedRequestclass and resolve the selected value against that base url.
- Remove some not so important log messages.
- Improve behavior of group step's
combineToSingleOutput(). When steps yield multiple outputs, don't combine all yielded outputs to one. Instead, combine the first output from the first step with the first output from the second step, and so on. - When results are not explicitly composed, but the outputs of the last step are arrays with string keys, it sets those keys on the Result object instead of setting a key
unnamedwith the whole array as value.
- The static methods
Html::getLink()andHtml::getLinks()now also work without argument, like theGetLinkandGetLinksclasses. - When a
DomQuery(CSS selector or XPath query) doesn't match anything, itsapply()method now returnsnull(instead of an empty string). When theHtml(/Xml)::extract()method is used with a single, not matching selector/query, nothing is yielded. When it's used with an array with a mapping, it yields an array with null values. If the selector for one of the methodsHtml(/Xml)::each(),Html(/Xml)::first()orHtml(/Xml)::last()doesn't match anything, that's not causing an error any longer, it just won't yield anything. - Removed the (unnecessary) second argument from the
Loop::withInput()method because whenkeepLoopingWithoutOutput()is called andwithInput()is called after that call, it resets the behavior. - Issue when date format for expires date in cookie doesn't have dashes in
d-M-Y(sod M Y).
- The
Jsonstep now also works with Http responses as input.
- The
BaseStepclass now haswhere()andorWhere()methods to filter step outputs. You can set multiple filters that will be applied to all outputs. When setting a filter usingorWhereit's linked to the previously added Filter with "OR". Outputs not matching one of the filters, are not yielded. The available filters can be accessed through static methods on the newFilterclass. Currently available filters are comparison filters (equal, greater/less than,...), a few string filters (contains, starts/ends with) and url filters (scheme, domain, host,...). - The
GetLinkandGetLinkssteps now have methodsonSameDomain(),notOnSameDomain(),onDomain(),onSameHost(),notOnSameHost(),onHost()to restrict the which links to find. - Automatically add the crawler's logger to the
Storeso you can also log messages from there. This can be breaking as theStoreInterfacenow also requires theaddLoggermethod. The new abstractStoreclass already implements it, so you can just extend it.
- The
Csvstep can now also be used without defining a column mapping. In that case it will use the values from the first line (so this makes sense when there are column headlines) as output array keys.
- By calling
monitorMemoryUsage()you can tell the Crawler to add log messages with the current memory usage after every step invocation. You can also set a limit in bytes when to start monitoring and below the limit it won't log memory usage.
- Previously the use of Generators actually didn't make a lot of sense, because the outputs of one step were only iterated and passed on to the next step, after the current step was invoked with all its inputs. That makes steps with a lot of inputs bottlenecks and causes bigger memory consumption. So, changed the crawler to immediately pass on outputs of one step to the next step if there is one.
uniqueOutputs()method to Steps to get only unique output values. If outputs are array or object, you can provide a key that will be used as identifier to check for uniqueness. Otherwise, the arrays or objects will be serialized for comparison which will probably be slower.runAndTraverse()method to Crawler, so you don't need to manually traverse the Generator, if you don't need the results where you're calling the crawler.- Implement the behaviour for when a
Groupstep should add something to the Result usingsetResultKey()oraddKeysToResult(), which was still missing. For groups this will only work when usingcombineToSingleOutput.