Skip to content

Catching crawling noise + ads #9

@TevenLeScao

Description

@TevenLeScao

Some datasets (f.e. bigscience-catalogue-lm-data/lm_es_pseudocrawl-filtered_396_www_eldiario_es) happen to have a mix of:

  • crawling noise
  • ad javascript
    that results in code-like snippets that can be caught by looking for { and }s. I have seen some amount of typo { that are unrelated for some reason, I don't think we want to remove them but I don't think we really care either. I'd advocate either removing any line in the pseudocrawled newspaper datasets that contain }, or removing all of the {...} groups.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions