-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Wiki-based dataset that are not wikipedia have at least the following issues:
- template text being repeated
- non-article pages (users, categories)
- some wikipedia formatting noise (may be only in the non-article pages though)
Some solutions already mentioned to deal with those:
- Looking at the
typefield inmeta: keeping onlytexttypes seems very strong - Deduplicating to remove templates
- Looking at the
titlefield in meta: allows to remove user pages for example
Metadata
Metadata
Assignees
Labels
No labels