Duplicate Results

1. Entity Filters
The paper identifies date/time tokens as the most common redundant entities. The correct per-dataset filters are:
​

HDFS — no filter needed. Block IDs (blk_XXXXXX) and IPs are genuine entities and should pass through unchanged.

BGL / TDB — filter the compound timestamp-correlation token (e.g. 1117838570-2005-06-03-15.42.50.675872) that appears in every log line and would connect all logs together as false positives

But for example for HDFS i don't get the same # of Entity nodes as reported in the paper. Can you provide the Filtering for HDFS, TDB and BGL.

2. Non-Grouped Sampling — Missing Implementation
This is never implemented in the code. The dataset loaders always attach a structured group ID to each log — block_id for HDFS, node_id for BGL, user_id/component_id for TDB — meaning every sample is inherently grouped by one of those fields.
Can you provide the implementation for this as well 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Results #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Duplicate Results #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions