You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 14, 2021. It is now read-only.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+9-1Lines changed: 9 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,9 +3,17 @@ All notable changes to this project will be documented in this file.
3
3
4
4
## Unreleased
5
5
6
+
- Merged PR#49 (Missing object when calling `->getContent()`)
7
+
- Imported all changes from Readability.js as of 2 March 2018 ([8525c6a](https://github.com/mozilla/readability/commit/8525c6af36d3badbe27c4672a6f2dd99ddb4097f)):
8
+
- Check for `<base>` elements before converting URLs to absolute.
9
+
- Clean `<link>` tags on `prepArticle()`
10
+
- Attempt to return at least some text if all the algorithm runs fail (Check PR [#423](https://github.com/mozilla/readability/pull/423) on JS version)
11
+
- Add new test cases for the previous changes
12
+
- And all other changes reflected [in this diff](https://github.com/mozilla/readability/compare/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5...8525c6af36d3badbe27c4672a6f2dd99ddb4097f)
PHP port of *Mozilla's***[Readability.js](https://github.com/mozilla/readability)**. Parses html text (usually news and other articles) and returns **title**, **author**, **main image** and **text content** without nav bars, ads, footers, or anything that isn't the main body of the text. Analyzes each node, gives them a score, and determines what's relevant and what can be discarded.
The project aim is to be a 1 to 1 port of Mozilla's version and to follow closely all changes introduced there, but there are some major differences on the structure. Most of the code is a 1:1 copy –even the comments were imported– but some functions and structures were adapted to suit better the PHP language.
9
9
10
+
**Lead Developer**: Andres Rey
11
+
10
12
## Requirements
11
13
12
14
PHP 5.6+, ext-dom, ext-xml, and ext-mbstring. To install all this dependencies (in the rare case your system does not have them already), you could try something like this in *nix like environments:
First you have to require the library using composer:
@@ -152,7 +152,7 @@ Self closing tags like `<br />` get automatically expanded to `<br></br`. No way
152
152
153
153
## Dependencies
154
154
155
-
Readability.php uses the [PSR Log](https://github.com/php-fig/log) interface to define the allowed type of loggers.
155
+
Readability.php uses the [PSR Log](https://github.com/php-fig/log) interface to define the allowed type of loggers.[Monolog](https://github.com/Seldaek/monolog) is only required on development installations. (`--dev` option during `composer install`).
156
156
157
157
## To-do
158
158
@@ -165,7 +165,7 @@ Readability parses all the text with DOMDocument, scans the text nodes and gives
165
165
166
166
## Code porting
167
167
168
-
Up to date with readability.js as of [16 Oct 2017](https://github.com/mozilla/readability/commit/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5).
168
+
Up to date with readability.js as of [2 Mar 2018](https://github.com/mozilla/readability/commit/8525c6af36d3badbe27c4672a6f2dd99ddb4097f).
$this->logger->info(sprintf('[Parsing] Article parsed. Amount of words: %s. Current threshold is: %s', $length, $this->configuration->getWordThreshold()));
164
168
165
-
if ($result && mb_strlen(preg_replace('/\s/', '', $result->textContent)) < $this->configuration->getWordThreshold()) {
169
+
$parseSuccessful = true;
170
+
171
+
if ($result && $length < $this->configuration->getWordThreshold()) {
0 commit comments