Skip to content

Adding more tools to the benchmark? #3

@adbar

Description

@adbar

Hi,

Thanks for your contribution, it's really useful to see evaluations on real-world data! There are further extraction tools for Python which this repository doesn't feature yet and which could be more efficient than some of the ones you're mentioning. You might have a look at

  • goose3
  • jusText (especially with a custom configuration)
  • inscriptis (html-to-txt conversion)
  • trafilatura (disclaimer: I'm the author).

Or is there a reason why you didn't use them in the first place? I'd be curious to hear about it.

For more details please refer to the evaluation I've performed. The code including baselines is available here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions