Skip to content

[Refactor] Improve maintainability and observability of data preprocessing pipelines #145

@vatsaljain568

Description

@vatsaljain568

Problem Statement

The current data preprocessing scripts (preprocess_docs.py, preprocess_plugin_docs.py, filter_processed_docs.py) contain implicit logic and intermediate variables that make the flow difficult to trace.
Additionally, filter_processed_docs.py uses generic log messages (e.g., "Filtering the url..."), making it hard to debug why specific documents are dropped. It also relies on magic numbers for heuristics.

Proposed Changes

Refactor the preprocessing scripts to improve code readability, debugging transparency, and maintainability.

  1. preprocess_docs.py & preprocess_plugin_docs.py:

    • Implement a sequential "pipeline" pattern for cleaning content.
    • Remove duplicated cleaning logic.
    • Add explicit comments for each cleaning step.
  2. filter_processed_docs.py:

    • Update logging to state the specific reason a document is dropped (e.g., "text length < 300").
    • Replace magic numbers with named constants (e.g., AVG_CHARS_PER_WORD_HEURISTIC).
    • Optimize the filtering loop to run in a single pass.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions