[Refactor] Improve maintainability and observability of data preprocessing pipelines

### Problem Statement
The current data preprocessing scripts (`preprocess_docs.py`, `preprocess_plugin_docs.py`, `filter_processed_docs.py`) contain implicit logic and intermediate variables that make the flow difficult to trace.
Additionally, `filter_processed_docs.py` uses generic log messages (e.g., "Filtering the url..."), making it hard to debug why specific documents are dropped. It also relies on magic numbers for heuristics.

### Proposed Changes
Refactor the preprocessing scripts to improve code readability, debugging transparency, and maintainability.

1. **preprocess_docs.py & preprocess_plugin_docs.py:**
   - Implement a sequential "pipeline" pattern for cleaning content.
   - Remove duplicated cleaning logic.
   - Add explicit comments for each cleaning step.

2. **filter_processed_docs.py:**
   - Update logging to state the *specific reason* a document is dropped (e.g., "text length < 300").
   - Replace magic numbers with named constants (e.g., `AVG_CHARS_PER_WORD_HEURISTIC`).
   - Optimize the filtering loop to run in a single pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Refactor] Improve maintainability and observability of data preprocessing pipelines #145

Problem Statement

Proposed Changes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Refactor] Improve maintainability and observability of data preprocessing pipelines #145

Description

Problem Statement

Proposed Changes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions