Skip to content
This repository was archived by the owner on Aug 14, 2021. It is now read-only.

Commit 44a3db9

Browse files
authored
Merge pull request #13 from andreskrey/development
Merging to master for next release
2 parents 9d4ce47 + dd353da commit 44a3db9

File tree

141 files changed

+45732
-62
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

141 files changed

+45732
-62
lines changed

.travis.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
language: php
22

3+
install: composer install
4+
35
php:
4-
- "5.3"
56
- "5.4"
67
- "5.5"
78
- "5.6"

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,14 @@ All notable changes to this project will be documented in this file.
33

44
## Unreleased
55

6+
## [0.0.3-alpha](https://github.com/andreskrey/readability.php/releases/tag/v0.0.3v-alpha)
7+
8+
We are getting closer to be a 100% complete port of Readability.js!
9+
- Added prepArticle to remove junk after selecting the top candidates.
10+
- Added a function to restore score after selecting top candidates. This basically works by scanning the data-readability tag and restoring the score to the contentScore variable. This is an horrible hack and should be removed once we ditch the Element interface of html-to-markdown and start extending the DOMDocument object.
11+
- Switched all strlen functions to mb_strlen
12+
- Fixed lots of bugs and pretty sure that introduced a bunch of new ones.
13+
614
## [0.0.2-alpha](https://github.com/andreskrey/readability.php/releases/tag/v0.0.2-alpha)
715
- Last version I'm using master as the main development branch. All unreleased changes and main development will happen in the develop branch.
816

README.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Readability.php
2-
[![Latest Stable Version](https://poser.pugx.org/andreskrey/readability.php/v/stable)](https://packagist.org/packages/andreskrey/readability.php) [![StyleCI](https://styleci.io/repos/71042668/shield?branch=master)](https://styleci.io/repos/71042668)
2+
[![Latest Stable Version](https://poser.pugx.org/andreskrey/readability.php/v/stable)](https://packagist.org/packages/andreskrey/readability.php) [![StyleCI](https://styleci.io/repos/71042668/shield?branch=master)](https://styleci.io/repos/71042668) [![Build Status](https://travis-ci.org/andreskrey/readability.php.svg?branch=master)](https://travis-ci.org/andreskrey/readability.php)
33

44
PHP port of *Mozilla's* **[Readability.js](https://github.com/mozilla/readability)**. Parses html text (usually news and other articles) and tries to return title, byline and text content. Analizes each text node, gives an score and orders them based on this calculation.
55

6-
**Requires**: PHP 5.4+
6+
**Requires**: PHP 5.4+ & DOMDocument (libxml)
77

88
**Lead Developer**: Andres Rey
99

@@ -40,8 +40,19 @@ $result = [
4040
]
4141
```
4242

43+
If the parsing process was unsuccessful the HTMLParser will return `false`
44+
4345
## Options
4446

47+
- **maxTopCandidates**: default value `5`, max amount of top level candidates.
48+
- **articleByLine**: default value `false`, search for the article byline.
49+
- **stripUnlikelyCandidates**: default value `true`, remove nodes that are unlikely to have relevant information. Useful for debugging or parsing complex or non-standard articles.
50+
- **cleanConditionally**: default value `true`, remove certain nodes after parsing to return a cleaner result.
51+
- **weightClasses**: default value `true`, weight classes during the rating phase.
52+
- **removeReadabilityTags**: default value `true`, remove the data-readability tags inside the nodes that are added during the rating phase.
53+
- **fixRelativeURLs**: default value `false`, convert relative URLs to absolute. Like `/test` to `http://host/test`.
54+
- **originalURL**: default value `http://fakehost`, original URL from the article used to fix relative URLs.
55+
4556
## Limitations
4657

4758
Of course the main limitation is PHP. Websites that load the content through lazy loading, AJAX, or any type of javascript fueled call will be ignored (actually, *not ran*) and the resulting text will be incorrect, compared to the readability.js results. All the articles you want to parse with readability.php will need to be complete and all the content should be in the HTML already.
@@ -78,6 +89,7 @@ Readability uses the Element interface and class from *The PHP League's* **[html
7889
100% of the original readability code was ported, at least until the last commit when I started this project ([13 Aug 2016](https://github.com/mozilla/readability/commit/71aa562387fa507b0bac30ae7144e1df7ba8a356)). There are a lot of `TODO`s around the code, which are the part that need to be finished.
7990

8091
- Right now the Readability object is an extension of the Element object of html-to-markdown. This is a problem because you lose context. The scoring when creating a new Readability object must be reloaded manually. The DOMDocument object is consistent across the same document. You change one value here and that will update all other nodes in other variables. By using the element interface you lose that reference and the score must be restored manually. Ideally, the Readability object should be an extension of the DOMDocument or DOMElement objects, the score should be saved within that object and no restoration or recalculation would be needed.
92+
- There are a lot of problems with responsabilities. Right now there are two classes: HTMLParser and Readability. HTMLParser does a lot of things that should be a responsibility of Readability. It also does a lot of things that should be part of another class, specially when building the final article DOMDocument.
8193

8294
## How it works
8395

phpunit.xml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<phpunit bootstrap="vendor/autoload.php"
3+
colors="true"
4+
stopOnFailure="false"
5+
stopOnError="false">
6+
<testsuites>
7+
<testsuite name="Readability.php Test Suite">
8+
<directory>./test/</directory>
9+
</testsuite>
10+
</testsuites>
11+
</phpunit>

0 commit comments

Comments
 (0)