Skip to content

Duplicate Results #2

@yosr-nabli

Description

@yosr-nabli
  1. Entity Filters
    The paper identifies date/time tokens as the most common redundant entities. The correct per-dataset filters are:

HDFS — no filter needed. Block IDs (blk_XXXXXX) and IPs are genuine entities and should pass through unchanged.

BGL / TDB — filter the compound timestamp-correlation token (e.g. 1117838570-2005-06-03-15.42.50.675872) that appears in every log line and would connect all logs together as false positives

But for example for HDFS i don't get the same # of Entity nodes as reported in the paper. Can you provide the Filtering for HDFS, TDB and BGL.

  1. Non-Grouped Sampling — Missing Implementation
    This is never implemented in the code. The dataset loaders always attach a structured group ID to each log — block_id for HDFS, node_id for BGL, user_id/component_id for TDB — meaning every sample is inherently grouped by one of those fields.
    Can you provide the implementation for this as well

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions