Skip to content
This repository was archived by the owner on Aug 14, 2021. It is now read-only.

Commit 0d02e29

Browse files
authored
Merge pull request #33 from andreskrey/v1.0
v1.0
2 parents bbf6068 + 62deed5 commit 0d02e29

File tree

110 files changed

+2786
-2519
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

110 files changed

+2786
-2519
lines changed

.travis.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,9 @@ language: php
33
install: composer install
44

55
php:
6-
- "5.4"
7-
- "5.5"
86
- "5.6"
97
- "7.0"
108
- "7.1"
9+
- "7.2"
1110

1211
sudo: false

AUTHORS.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
# Authors
22

3-
Readability.php developed by **Andres Rey**. Copyright (c) 2010 Arc90 Inc
3+
Readability.php developed by **Andres Rey**.
4+
5+
Based on Arc90's readability.js (1.7.1) script available at: http://code.google.com/p/arc90labs-readability.
6+
Copyright (c) 2010 Arc90 Inc
47

58
The AUTHORS/Contributors are (and/or have been):
69

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,15 @@ All notable changes to this project will be documented in this file.
33

44
## Unreleased
55

6+
## [v1.0.0](https://github.com/andreskrey/readability.php/releases/tag/v1.0.0)
7+
8+
- Node encapsulation is gone. Pre v1 all nodes where encapsulated in a Readability class, which created lots of trouble with dependencies, responsibilities, and properties. Now all the encapsulation is gone: all the DOMNodes inside the Readability class are extensions of the original DOM classes, which allows the system to take advantage of the functions and properties of DOMDocument.
9+
- HTMLParser is gone, Readability is the new main class. Switched things a bit for this release. Pre v1 you had to create an HTMLParser class to parse the HTML. Now you have to create a Readability class, feed it the text, and check the result.
10+
- No more dumb arrays as a result. If you want to get the title, content, images, or anything else you'll have to use the getters of the Readability class.
11+
- Environment class is gone. Now you have to create a configuration class and use setters to set your configuration options.
12+
- Exceptions. Make sure you wrap your Readability class in a try catch block, because if it fails to parse your HTML, it will throw a `ParseException`.
13+
- Minimum PHP version bumped to 5.6.
14+
615
## [v0.3.1](https://github.com/andreskrey/readability.php/releases/tag/v0.3.1)
716

817
- Trim titles when detecting hierarchical separators to avoid false negatives on strings with spaces.

README.md

Lines changed: 64 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,66 +1,93 @@
11
# Readability.php
22
[![Latest Stable Version](https://poser.pugx.org/andreskrey/readability.php/v/stable)](https://packagist.org/packages/andreskrey/readability.php) [![StyleCI](https://styleci.io/repos/71042668/shield?branch=master)](https://styleci.io/repos/71042668) [![Build Status](https://travis-ci.org/andreskrey/readability.php.svg?branch=master)](https://travis-ci.org/andreskrey/readability.php) [![Total Downloads](https://poser.pugx.org/andreskrey/readability.php/downloads)](https://packagist.org/packages/andreskrey/readability.php) [![Monthly Downloads](https://poser.pugx.org/andreskrey/readability.php/d/monthly)](https://packagist.org/packages/andreskrey/readability.php)
33

4-
PHP port of *Mozilla's* **[Readability.js](https://github.com/mozilla/readability)**. Parses html text (usually news and other articles) and returns title, byline and text content without nav bars, ads, footers, or anything that isn't the main body of the text. Analizes each text node, gives an score and orders them based on this calculation.
4+
PHP port of *Mozilla's* **[Readability.js](https://github.com/mozilla/readability)**. Parses html text (usually news and other articles) and returns **title**, **author**, **main image** and **text content** without nav bars, ads, footers, or anything that isn't the main body of the text. Analyzes each node, gives them a score, and determines what's relevant and what can be discarded.
55

6-
**Requires**: PHP 5.4+ & DOMDocument (libxml)
6+
![Screenshot](https://raw.githubusercontent.com/andreskrey/readability.php/assets/screenshot.png)
77

8-
**Lead Developer**: Andres Rey
8+
The project aim is to be a 1 to 1 port of Mozilla's version and to follow closely all changes introduced there, but there are some major differences on the structure. Most of the code is a 1:1 copy –even the comments were imported– but some functions and structures were adapted to suit better the PHP language.
9+
10+
## Requirements
911

10-
## Status
12+
PHP 5.6+, ext-dom, ext-xml, and ext-mbstring. To install all this dependencies (in the rare case your system does not have them already), you could try something like this in *nix like environments:
1113

12-
Current status is stable. Version 1.0 is around the corner.
14+
`$ sudo apt-get install php7.1-xml php7.1-mbstring`
15+
16+
**Lead Developer**: Andres Rey
1317

1418
## How to use it
1519

1620
First you have to require the library using composer:
1721

1822
`composer require andreskrey/readability.php`
1923

20-
Then, create and HTMLParser object with your preferences, feed the `parse()` function with your HTML and check the resulting array:
24+
Then, create a Readability class and pass a Configuration class, feed the `parse()` function with your HTML and echo the variable:
2125

2226
```php
2327
use andreskrey\Readability\HTMLParser;
28+
use andreskrey\Readability\Configuration;
2429

25-
$readability = new HTMLParser();
30+
$readability = new Readability(new Configuration());
2631

2732
$html = file_get_contents('http://your.favorite.newspaper/article.html');
2833

29-
$result = $readability->parse($html);
34+
try {
35+
$readability->parse($html);
36+
echo $readability;
37+
} catch (ParseException $e) {
38+
echo sprintf('Error processing text: %s', $e->getMessage);
39+
}
3040
```
3141

32-
The `$result` variable now will hold the following information:
42+
Your script will output the parsed text or inform about any errors. You should always wrap the `->parse` call in a try/catch block because if the HTML cannot be parsed correctly, a `ParseException` will be thrown.
43+
44+
If you want to have a finer control on the output, just call the properties one by one, wrapping it with your own HTML.
45+
46+
```php
47+
<h1><?= $readability->getTitle(); ?></h1>
48+
<h2>By <?= $readability->getAuthor(); ?></h2>
49+
<div class="content"><?= $readability->getContent(); ?></div>
3350

34-
```
35-
$result = [
36-
'title' => 'Title of the article',
37-
'author' => 'Name of the author of the article',
38-
'image' => 'Main image of the article',
39-
'images' => 'All images of the article',
40-
'article' => 'DOMDocument with the full article text, scored and parsed'
41-
]
4251
```
4352

44-
If the parsing process was unsuccessful the HTMLParser will return `false`
53+
Here's a list of the available properties:
54+
55+
- Article title: `->getTitle();`
56+
- Article content: `->getContent();`
57+
- Excerpt: `->getExcerpt();`
58+
- Main image: `->getImage();`
59+
- All images: `->getImages();`
60+
- Author: `->getAuthor();`
61+
- Text direction (ltr or rtl): `->getDirection();`
4562

4663
## Options
4764

48-
- **maxTopCandidates**: default value `5`, max amount of top level candidates.
49-
- **wordThreshold**: default value `500`, minimum amount of characters to consider that the article was parsed successful.
50-
- **articleByLine**: default value `false`, search for the article byline and remove it from the text. It will be moved to the article metadata.
51-
- **stripUnlikelyCandidates**: default value `true`, remove nodes that are unlikely to have relevant information. Useful for debugging or parsing complex or non-standard articles.
52-
- **cleanConditionally**: default value `true`, remove certain nodes after parsing to return a cleaner result.
53-
- **weightClasses**: default value `true`, weight classes during the rating phase.
54-
- **removeReadabilityTags**: default value `true`, remove the data-readability tags inside the nodes that are added during the rating phase.
55-
- **fixRelativeURLs**: default value `false`, convert relative URLs to absolute. Like `/test` to `http://host/test`.
56-
- **substituteEntities**: default value `false`, disables the `substituteEntities` flag of libxml. Will avoid substituting HTML entities. Like `&aacute;` to á.
57-
- **normalizeEntities**: default value `false`, converts UTF-8 characters to its HTML Entity equivalent. Useful to parse HTML with mixed encoding.
58-
- **originalURL**: default value `http://fakehost`, original URL from the article used to fix relative URLs.
59-
- **summonCthulhu**: default value `false`, remove all `<script>` nodes via regex. This is not ideal as it might break things, but might be the only solution to [libxml problems with unescaped javascript](https://github.com/andreskrey/readability.php#known-issues).
65+
You can change the behaviour of Readability via the Configuration object. For example, if you want to fix relative URLs and declare the original URL, you could set up the configuration like this:
66+
67+
```php
68+
$configuration = new Configuration();
69+
$configuration->setFixRelativeURLs(true)
70+
->setOriginalURL('http://my.newspaper.url/article/something-interesting-to-read.html');
71+
```
72+
73+
Then you pass this Configuration object to Readability. The following options are available. Remember to prepend `set` when calling them.
74+
75+
- **MaxTopCandidates**: default value `5`, max amount of top level candidates.
76+
- **WordThreshold**: default value `500`, minimum amount of characters to consider that the article was parsed successful.
77+
- **ArticleByLine**: default value `false`, search for the article byline and remove it from the text. It will be moved to the article metadata.
78+
- **StripUnlikelyCandidates**: default value `true`, remove nodes that are unlikely to have relevant information. Useful for debugging or parsing complex or non-standard articles.
79+
- **CleanConditionally**: default value `true`, remove certain nodes after parsing to return a cleaner result.
80+
- **WeightClasses**: default value `true`, weight classes during the rating phase.
81+
- **RemoveReadabilityTags**: default value `true`, remove the data-readability tags inside the nodes that are added during the rating phase.
82+
- **FixRelativeURLs**: default value `false`, convert relative URLs to absolute. Like `/test` to `http://host/test`.
83+
- **SubstituteEntities**: default value `false`, disables the `substituteEntities` flag of libxml. Will avoid substituting HTML entities. Like `&aacute;` to á.
84+
- **NormalizeEntities**: default value `false`, converts UTF-8 characters to its HTML Entity equivalent. Useful to parse HTML with mixed encoding.
85+
- **OriginalURL**: default value `http://fakehost`, original URL from the article used to fix relative URLs.
86+
- **SummonCthulhu**: default value `false`, remove all `<script>` nodes via regex. This is not ideal as it might break things, but might be the only solution to [libxml problems with unescaped javascript](https://github.com/andreskrey/readability.php#known-issues). If you're not parsing Javascript tutorials, it's recommended to always set this option as `true`.
6087

6188
## Limitations
6289

63-
Of course the main limitation is PHP. Websites that load the content through lazy loading, AJAX, or any type of javascript fueled call will be ignored (actually, *not ran*) and the resulting text will be incorrect, compared to the readability.js results. All the articles you want to parse with readability.php will need to be complete and all the content should be in the HTML already.
90+
Of course the main limitation is PHP. Websites that load the content through lazy loading, AJAX, or any type of javascript fueled call will be ignored (actually, *not ran*) and the resulting text will be incorrect, compared to the readability.js results. All the articles you want to parse with readability.php need to be complete and all the content should be in the HTML already.
6491

6592
## Known Issues
6693

@@ -76,7 +103,7 @@ DOMDocument has some issues while parsing javascript with unescaped HTML on stri
76103
</script>
77104
```
78105

79-
If you would like to remove the scripts of the HTML (like readability does), you would expect ending up with just one div and one comment on the final HTML. The problem is that libxml takes that closing div tag inside the javascript string as a HTML tag, effectively closing the unclosed tag and leaving the rest of the javascript as a string withing a P tag. If you save that node, the final HTML will end up like this:
106+
If you would like to remove the scripts of the HTML (like readability does), you would expect ending up with just one div and one comment on the final HTML. The problem is that libxml takes that closing div tag inside the javascript string as a HTML tag, effectively closing the unclosed tag and leaving the rest of the javascript as a string within a P tag. If you save that node, the final HTML will end up like this:
80107

81108
```html
82109
<div> <!-- Offending div without closing tag -->
@@ -99,12 +126,12 @@ Self closing tags like `<br />` get automatically expanded to `<br></br`. No way
99126

100127
## Dependencies
101128

102-
Readability uses the Element interface and class from *The PHP League's* **[html-to-markdown](https://github.com/thephpleague/html-to-markdown/)**. The Readability object is an extension of the Element class. It overrides some methods but relies on it for basic DOMElement parsing.
129+
Readability.php has no dependencies to other libraries.
103130

104131
## To-do
105132

106-
- Right now the Readability object is an extension of the Element object of html-to-markdown. This is a problem because you lose context. The scoring when creating a new Readability object must be reloaded manually. The DOMDocument object is consistent across the same document. You change one value here and that will update all other nodes in other variables. By using the element interface you lose that reference and the score must be restored manually. Ideally, the Readability object should be an extension of the DOMDocument or DOMElement objects, the score should be saved within that object and no restoration or recalculation would be needed.
107-
- There are a lot of problems with responsabilities. Right now there are two classes: HTMLParser and Readability. HTMLParser does a lot of things that should be a responsibility of Readability. It also does a lot of things that should be part of another class, specially when building the final article DOMDocument.
133+
- Keep up with Readability.js changes
134+
- Add a small template engine for the __toString() method, instead of using a hardcoded one.
108135

109136
## How it works
110137

@@ -114,12 +141,10 @@ Readability parses all the text with DOMDocument, scans the text nodes and gives
114141

115142
Up to date with readability.js as of [16 Oct 2017](https://github.com/mozilla/readability/commit/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5).
116143

117-
## TO-DOs of the current port:
118-
119-
- Port `_cleanStyles` to avoid style attributes inside other tags (like `<p style="hello ">`)
120-
121144
## License
122145

146+
Based on Arc90's readability.js (1.7.1) script available at: http://code.google.com/p/arc90labs-readability
147+
123148
Copyright (c) 2010 Arc90 Inc
124149

125150
Licensed under the Apache License, Version 2.0 (the "License");

composer.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@
1818
}
1919
},
2020
"require": {
21-
"php": ">=5.4.0",
21+
"php": ">=5.6.0",
2222
"ext-dom": "*",
2323
"ext-xml": "*",
24-
"league/html-to-markdown": "^4.2"
24+
"ext-mbstring": "*"
2525
},
2626
"require-dev": {
27-
"phpunit/phpunit": "4.*"
27+
"phpunit/phpunit": "^5.7"
2828
}
2929
}

phpunit.xml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,9 @@
88
<directory>./test/</directory>
99
</testsuite>
1010
</testsuites>
11-
</phpunit>
11+
<filter>
12+
<whitelist>
13+
<directory suffix=".php">src/</directory>
14+
</whitelist>
15+
</filter>
16+
</phpunit>

0 commit comments

Comments
 (0)