-
-
Notifications
You must be signed in to change notification settings - Fork 50
Open
Enhancement
Copy link
Description
Problem Statement
The current data preprocessing scripts (preprocess_docs.py, preprocess_plugin_docs.py, filter_processed_docs.py) contain implicit logic and intermediate variables that make the flow difficult to trace.
Additionally, filter_processed_docs.py uses generic log messages (e.g., "Filtering the url..."), making it hard to debug why specific documents are dropped. It also relies on magic numbers for heuristics.
Proposed Changes
Refactor the preprocessing scripts to improve code readability, debugging transparency, and maintainability.
-
preprocess_docs.py & preprocess_plugin_docs.py:
- Implement a sequential "pipeline" pattern for cleaning content.
- Remove duplicated cleaning logic.
- Add explicit comments for each cleaning step.
-
filter_processed_docs.py:
- Update logging to state the specific reason a document is dropped (e.g., "text length < 300").
- Replace magic numbers with named constants (e.g.,
AVG_CHARS_PER_WORD_HEURISTIC). - Optimize the filtering loop to run in a single pass.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels