-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
- Entity Filters
The paper identifies date/time tokens as the most common redundant entities. The correct per-dataset filters are:
HDFS — no filter needed. Block IDs (blk_XXXXXX) and IPs are genuine entities and should pass through unchanged.
BGL / TDB — filter the compound timestamp-correlation token (e.g. 1117838570-2005-06-03-15.42.50.675872) that appears in every log line and would connect all logs together as false positives
But for example for HDFS i don't get the same # of Entity nodes as reported in the paper. Can you provide the Filtering for HDFS, TDB and BGL.
- Non-Grouped Sampling — Missing Implementation
This is never implemented in the code. The dataset loaders always attach a structured group ID to each log — block_id for HDFS, node_id for BGL, user_id/component_id for TDB — meaning every sample is inherently grouped by one of those fields.
Can you provide the implementation for this as well
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels