-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Some datasets (f.e. bigscience-catalogue-lm-data/lm_es_pseudocrawl-filtered_396_www_eldiario_es) happen to have a mix of:
- crawling noise
- ad javascript
that results in code-like snippets that can be caught by looking for{and}s. I have seen some amount of typo{that are unrelated for some reason, I don't think we want to remove them but I don't think we really care either. I'd advocate either removing any line in the pseudocrawled newspaper datasets that contain}, or removing all of the{...}groups.
Metadata
Metadata
Assignees
Labels
No labels