Inquiry regarding intra-document quality filtering in Qwen pre-training #1812

XevWright · 2026-02-17T08:01:07Z

XevWright
Feb 17, 2026

Hi, thank you for the incredible contribution!

I am specifically interested in how your pipeline handles intra-document noise within very long contexts for LLM pre-training. For example, if an extensively long scraped document contains 90% high-quality text but 10% auto-generated boilerplate or SEO spam in the middle:

Does your pipeline actively slice/mask out/clean the specific flagged chunks (intra-document removal) and stitch the remaining benign tokens back together?

Or is the primary philosophy still strictly document-level dropping (if the ratio of flagged spans exceeds a threshold, the entire document is discarded)?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry regarding intra-document quality filtering in Qwen pre-training #1812

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Inquiry regarding intra-document quality filtering in Qwen pre-training #1812

Uh oh!

XevWright Feb 17, 2026

Replies: 0 comments

XevWright
Feb 17, 2026