Skip to content
This repository was archived by the owner on Aug 14, 2021. It is now read-only.

Commit 45c5826

Browse files
authored
Merge pull request #54 from andreskrey/development
Prepare for release
2 parents f0f6906 + a7b5fa2 commit 45c5826

File tree

93 files changed

+3588
-1447
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

93 files changed

+3588
-1447
lines changed

.coveralls.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
coverage_clover: test/clover.xml
2+
json_path: test/coveralls-upload.json
3+
service_name: travis-ci

.travis.yml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,19 @@
11
language: php
22

3-
install: composer install
3+
install:
4+
- composer install
45

56
php:
67
- "5.6"
78
- "7.0"
89
- "7.1"
910
- "7.2"
1011

12+
script:
13+
- ./vendor/bin/phpunit --coverage-clover ./test/clover.xml
14+
15+
after_script:
16+
- composer require php-coveralls/php-coveralls:^2.0
17+
- php ./vendor/php-coveralls/php-coveralls/bin/php-coveralls -v
18+
1119
sudo: false

CHANGELOG.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,17 @@ All notable changes to this project will be documented in this file.
33

44
## Unreleased
55

6+
- Merged PR#49 (Missing object when calling `->getContent()`)
7+
- Imported all changes from Readability.js as of 2 March 2018 ([8525c6a](https://github.com/mozilla/readability/commit/8525c6af36d3badbe27c4672a6f2dd99ddb4097f)):
8+
- Check for `<base>` elements before converting URLs to absolute.
9+
- Clean `<link>` tags on `prepArticle()`
10+
- Attempt to return at least some text if all the algorithm runs fail (Check PR [#423](https://github.com/mozilla/readability/pull/423) on JS version)
11+
- Add new test cases for the previous changes
12+
- And all other changes reflected [in this diff](https://github.com/mozilla/readability/compare/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5...8525c6af36d3badbe27c4672a6f2dd99ddb4097f)
13+
614
## [v1.1.1](https://github.com/andreskrey/readability.php/releases/tag/v1.1.1)
715

8-
- Switched from assertEquals to assertSame on unit testing to avoid weak comparisons.
16+
- Switched from assertEquals to assertSame on unit testing to avoid weak comparisons.
917
- Added a safe check to avoid sending the DOMDocument as a node when scanning for node ancestors.
1018
- Fix issue #45: Small mistake in documentation
1119
- Fix issue #46: Added `data-src` as a image source path

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
# Readability.php
2-
[![Latest Stable Version](https://poser.pugx.org/andreskrey/readability.php/v/stable)](https://packagist.org/packages/andreskrey/readability.php) [![StyleCI](https://styleci.io/repos/71042668/shield?branch=master)](https://styleci.io/repos/71042668) [![Build Status](https://travis-ci.org/andreskrey/readability.php.svg?branch=master)](https://travis-ci.org/andreskrey/readability.php) [![Total Downloads](https://poser.pugx.org/andreskrey/readability.php/downloads)](https://packagist.org/packages/andreskrey/readability.php) [![Monthly Downloads](https://poser.pugx.org/andreskrey/readability.php/d/monthly)](https://packagist.org/packages/andreskrey/readability.php)
2+
[![Latest Stable Version](https://poser.pugx.org/andreskrey/readability.php/v/stable)](https://packagist.org/packages/andreskrey/readability.php) [![Build Status](https://travis-ci.org/andreskrey/readability.php.svg?branch=master)](https://travis-ci.org/andreskrey/readability.php) [![Coverage Status](https://coveralls.io/repos/github/andreskrey/readability.php/badge.svg?branch=master)](https://coveralls.io/github/andreskrey/readability.php/?branch=master) [![StyleCI](https://styleci.io/repos/71042668/shield?branch=master)](https://styleci.io/repos/71042668) [![Total Downloads](https://poser.pugx.org/andreskrey/readability.php/downloads)](https://packagist.org/packages/andreskrey/readability.php) [![Monthly Downloads](https://poser.pugx.org/andreskrey/readability.php/d/monthly)](https://packagist.org/packages/andreskrey/readability.php)
33

44
PHP port of *Mozilla's* **[Readability.js](https://github.com/mozilla/readability)**. Parses html text (usually news and other articles) and returns **title**, **author**, **main image** and **text content** without nav bars, ads, footers, or anything that isn't the main body of the text. Analyzes each node, gives them a score, and determines what's relevant and what can be discarded.
55

66
![Screenshot](https://raw.githubusercontent.com/andreskrey/readability.php/assets/screenshot.png)
77

88
The project aim is to be a 1 to 1 port of Mozilla's version and to follow closely all changes introduced there, but there are some major differences on the structure. Most of the code is a 1:1 copy –even the comments were imported– but some functions and structures were adapted to suit better the PHP language.
99

10+
**Lead Developer**: Andres Rey
11+
1012
## Requirements
1113

1214
PHP 5.6+, ext-dom, ext-xml, and ext-mbstring. To install all this dependencies (in the rare case your system does not have them already), you could try something like this in *nix like environments:
1315

1416
`$ sudo apt-get install php7.1-xml php7.1-mbstring`
1517

16-
**Lead Developer**: Andres Rey
17-
1818
## How to use it
1919

2020
First you have to require the library using composer:
@@ -152,7 +152,7 @@ Self closing tags like `<br />` get automatically expanded to `<br></br`. No way
152152

153153
## Dependencies
154154

155-
Readability.php uses the [PSR Log](https://github.com/php-fig/log) interface to define the allowed type of loggers.
155+
Readability.php uses the [PSR Log](https://github.com/php-fig/log) interface to define the allowed type of loggers. [Monolog](https://github.com/Seldaek/monolog) is only required on development installations. (`--dev` option during `composer install`).
156156

157157
## To-do
158158

@@ -165,7 +165,7 @@ Readability parses all the text with DOMDocument, scans the text nodes and gives
165165

166166
## Code porting
167167

168-
Up to date with readability.js as of [16 Oct 2017](https://github.com/mozilla/readability/commit/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5).
168+
Up to date with readability.js as of [2 Mar 2018](https://github.com/mozilla/readability/commit/8525c6af36d3badbe27c4672a6f2dd99ddb4097f).
169169

170170
## License
171171

composer.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,8 @@
2525
"psr/log": "^1.0"
2626
},
2727
"require-dev": {
28-
"phpunit/phpunit": "^5.7"
28+
"phpunit/phpunit": "^5.7",
29+
"monolog/monolog": "^1.23"
2930
},
3031
"suggest": {
3132
"monolog/monolog": "Allow logging debug information"

src/Configuration.php

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,18 @@ public function getLogger()
114114
}
115115
}
116116

117+
/**
118+
* @param LoggerInterface $logger
119+
*
120+
* @return Configuration
121+
*/
122+
public function setLogger(LoggerInterface $logger)
123+
{
124+
$this->logger = $logger;
125+
126+
return $this;
127+
}
128+
117129
/**
118130
* @return int
119131
*/

src/Nodes/DOM/DOMDocument.php

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,11 @@ public function __construct($version, $encoding)
2020
$this->registerNodeClass('DOMDocumentFragment', DOMDocumentFragment::class);
2121
$this->registerNodeClass('DOMDocumentType', DOMDocumentType::class);
2222
$this->registerNodeClass('DOMElement', DOMElement::class);
23+
$this->registerNodeClass('DOMEntity', DOMEntity::class);
24+
$this->registerNodeClass('DOMEntityReference', DOMEntityReference::class);
2325
$this->registerNodeClass('DOMNode', DOMNode::class);
2426
$this->registerNodeClass('DOMNotation', DOMNotation::class);
2527
$this->registerNodeClass('DOMProcessingInstruction', DOMProcessingInstruction::class);
2628
$this->registerNodeClass('DOMText', DOMText::class);
27-
$this->registerNodeClass('DOMEntityReference', DOMEntityReference::class);
2829
}
2930
}

src/Nodes/DOM/DOMEntity.php

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
<?php
2+
3+
namespace andreskrey\Readability\Nodes\DOM;
4+
5+
use andreskrey\Readability\Nodes\NodeTrait;
6+
7+
class DOMEntity extends \DOMEntity
8+
{
9+
use NodeTrait;
10+
}

src/Nodes/NodeTrait.php

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@
77
use andreskrey\Readability\Nodes\DOM\DOMNode;
88
use andreskrey\Readability\Nodes\DOM\DOMText;
99

10+
/**
11+
* @method \DOMNode removeAttribute($name)
12+
*/
1013
trait NodeTrait
1114
{
1215
/**

src/Readability.php

Lines changed: 93 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,13 @@ class Readability
7777
*/
7878
private $logger;
7979

80+
/**
81+
* Collection of attempted text extractions.
82+
*
83+
* @var array
84+
*/
85+
private $attempts = [];
86+
8087
/**
8188
* @var array
8289
*/
@@ -155,54 +162,76 @@ public function parse($html)
155162
* finding the -right- content.
156163
*/
157164

158-
$length = 0;
159-
foreach ($result->getElementsByTagName('p') as $p) {
160-
$length += mb_strlen($p->textContent);
161-
}
165+
$length = mb_strlen(preg_replace(NodeUtility::$regexps['onlyWhitespace'], '', $result->textContent));
162166

163167
$this->logger->info(sprintf('[Parsing] Article parsed. Amount of words: %s. Current threshold is: %s', $length, $this->configuration->getWordThreshold()));
164168

165-
if ($result && mb_strlen(preg_replace('/\s/', '', $result->textContent)) < $this->configuration->getWordThreshold()) {
169+
$parseSuccessful = true;
170+
171+
if ($result && $length < $this->configuration->getWordThreshold()) {
166172
$this->dom = $this->loadHTML($html);
167173
$root = $this->dom->getElementsByTagName('body')->item(0);
174+
$parseSuccessful = false;
168175

169176
if ($this->configuration->getStripUnlikelyCandidates()) {
170177
$this->logger->debug('[Parsing] Threshold not met, trying again setting StripUnlikelyCandidates as false');
171178
$this->configuration->setStripUnlikelyCandidates(false);
179+
$this->attempts[] = ['articleContent' => $result, 'textLength' => $length];
172180
} elseif ($this->configuration->getWeightClasses()) {
173181
$this->logger->debug('[Parsing] Threshold not met, trying again setting WeightClasses as false');
174182
$this->configuration->setWeightClasses(false);
183+
$this->attempts[] = ['articleContent' => $result, 'textLength' => $length];
175184
} elseif ($this->configuration->getCleanConditionally()) {
176185
$this->logger->debug('[Parsing] Threshold not met, trying again setting CleanConditionally as false');
177186
$this->configuration->setCleanConditionally(false);
187+
$this->attempts[] = ['articleContent' => $result, 'textLength' => $length];
178188
} else {
179-
$this->logger->emergency('[Parsing] Could not parse text, giving up :(');
189+
$this->logger->debug('[Parsing] Threshold not met, searching across attempts for some content.');
190+
$this->attempts[] = ['articleContent' => $result, 'textLength' => $length];
191+
192+
// No luck after removing flags, just return the longest text we found during the different loops
193+
usort($this->attempts, function ($a, $b) {
194+
return $a['textLength'] < $b['textLength'];
195+
});
196+
197+
// But first check if we actually have something
198+
if (!$this->attempts[0]['textLength']) {
199+
$this->logger->emergency('[Parsing] Could not parse text, giving up :(');
180200

181-
throw new ParseException('Could not parse text.');
201+
throw new ParseException('Could not parse text.');
202+
}
203+
204+
$this->logger->debug('[Parsing] Threshold not met, but found some content in previous attempts.');
205+
206+
$result = $this->attempts[0]['articleContent'];
207+
$parseSuccessful = true;
208+
break;
182209
}
183210
} else {
184211
break;
185212
}
186213
}
187214

188-
$result = $this->postProcessContent($result);
189-
190-
// If we haven't found an excerpt in the article's metadata, use the article's
191-
// first paragraph as the excerpt. This can be used for displaying a preview of
192-
// the article's content.
193-
if (!$this->getExcerpt()) {
194-
$this->logger->debug('[Parsing] No excerpt text found on metadata, extracting first p node and using it as excerpt.');
195-
$paragraphs = $result->getElementsByTagName('p');
196-
if ($paragraphs->length > 0) {
197-
$this->setExcerpt(trim($paragraphs->item(0)->textContent));
215+
if ($parseSuccessful) {
216+
$result = $this->postProcessContent($result);
217+
218+
// If we haven't found an excerpt in the article's metadata, use the article's
219+
// first paragraph as the excerpt. This can be used for displaying a preview of
220+
// the article's content.
221+
if (!$this->getExcerpt()) {
222+
$this->logger->debug('[Parsing] No excerpt text found on metadata, extracting first p node and using it as excerpt.');
223+
$paragraphs = $result->getElementsByTagName('p');
224+
if ($paragraphs->length > 0) {
225+
$this->setExcerpt(trim($paragraphs->item(0)->textContent));
226+
}
198227
}
199-
}
200228

201-
$this->setContent($result);
229+
$this->setContent($result);
202230

203-
$this->logger->info('*** Parse successful :)');
231+
$this->logger->info('*** Parse successful :)');
204232

205-
return true;
233+
return true;
234+
}
206235
}
207236

208237
/**
@@ -468,6 +497,10 @@ private function getArticleTitle()
468497
if (count(preg_split('/\s+/', $curTitle)) < 3) {
469498
$curTitle = substr($originalTitle, strpos($originalTitle, ':') + 1);
470499
$this->logger->info(sprintf('[Metadata] Title too short, using the first part of the title instead: \'%s\'', $curTitle));
500+
} elseif (count(preg_split('/\s+/', substr($curTitle, 0, strpos($curTitle, ':')))) > 5) {
501+
// But if we have too many words before the colon there's something weird
502+
// with the titles and the H tags so let's just use the original title instead
503+
$curTitle = $originalTitle;
471504
}
472505
}
473506
} elseif (mb_strlen($curTitle) > 150 || mb_strlen($curTitle) < 15) {
@@ -549,7 +582,19 @@ private function toAbsoluteURI($uri)
549582
*/
550583
public function getPathInfo($url)
551584
{
552-
$pathBase = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST) . dirname(parse_url($url, PHP_URL_PATH)) . '/';
585+
// Check for base URLs
586+
if ($this->dom->baseURI !== null) {
587+
if (substr($this->dom->baseURI, 0, 1) === '/') {
588+
// URLs starting with '/' override completely the URL defined in the link
589+
$pathBase = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST) . $this->dom->baseURI;
590+
} else {
591+
// Otherwise just prepend the base to the actual path
592+
$pathBase = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST) . dirname(parse_url($url, PHP_URL_PATH)) . '/' . rtrim($this->dom->baseURI, '/') . '/';
593+
}
594+
} else {
595+
$pathBase = parse_url($url, PHP_URL_SCHEME) . '://' . parse_url($url, PHP_URL_HOST) . dirname(parse_url($url, PHP_URL_PATH)) . '/';
596+
}
597+
553598
$scheme = parse_url($pathBase, PHP_URL_SCHEME);
554599
$prePath = $scheme . '://' . parse_url($pathBase, PHP_URL_HOST);
555600

@@ -1129,6 +1174,7 @@ public function prepArticle(DOMDocument $article)
11291174
$this->_clean($article, 'embed');
11301175
$this->_clean($article, 'h1');
11311176
$this->_clean($article, 'footer');
1177+
$this->_clean($article, 'link');
11321178

11331179
// Clean out elements have "share" in their id/class combinations from final top candidates,
11341180
// which means we don't remove the top candidates even they have "share".
@@ -1479,6 +1525,28 @@ public function _cleanHeaders(DOMDocument $article)
14791525
}
14801526
}
14811527

1528+
/**
1529+
* Removes the class="" attribute from every element in the given
1530+
* subtree.
1531+
*
1532+
* Readability.js has a special filter to avoid cleaning the classes that the algorithm adds. We don't add classes
1533+
* here so no need to filter those.
1534+
*
1535+
* @param DOMDocument|DOMNode $node
1536+
*
1537+
* @return void
1538+
**/
1539+
public function _cleanClasses($node)
1540+
{
1541+
if ($node->getAttribute('class') !== '') {
1542+
$node->removeAttribute('class');
1543+
}
1544+
1545+
for ($node = $node->firstChild; $node !== null; $node = $node->nextSibling) {
1546+
$this->_cleanClasses($node);
1547+
}
1548+
}
1549+
14821550
/**
14831551
* @param DOMDocument $article
14841552
*
@@ -1532,6 +1600,8 @@ public function postProcessContent(DOMDocument $article)
15321600
}
15331601
}
15341602

1603+
$this->_cleanClasses($article);
1604+
15351605
return $article;
15361606
}
15371607

@@ -1564,7 +1634,7 @@ protected function setTitle($title)
15641634
*/
15651635
public function getContent()
15661636
{
1567-
return $this->content->C14N();
1637+
return ($this->content instanceof DOMDocument) ? $this->content->C14N() : null;
15681638
}
15691639

15701640
/**

0 commit comments

Comments
 (0)