Skip to content
This repository was archived by the owner on Aug 14, 2021. It is now read-only.

HTML infiltrating in imported content #80

@sfatfarma

Description

@sfatfarma

Hello,

First of all, congrats for the code!

On some content, HTML code is infiltrating in extracted textual content. Example for this article:

https://www.yahoo.com/entertainment/chris-evans-ignites-celeb-civil-085545038.html

Snippet of the extracted content:

Chris Evans causing people to choose sides in a way that hasn’t been seen since “Captain America: Civil War.”” data-reactid=”16″ type=”text”>There’s an issue that’s divided Twitter almost as much as anything in politics this week, with Chris Evans causing people to choose sides in a way that hasn’t been seen since “Captain America: Civil War.”

As you can see, this is infiltrating in content:

”” data-reactid=”16″ type=”text”>

This is mu current code:

$readConf = new Configuration(); $readConf->setSummonCthulhu(true); $readability = new Readability($readConf); $readability->parse($html_string); $return_me = $readability->getContent();

Any help is appreciated.

Regards,
Szabi.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions