Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
94cc438
Remove unncessary dockerfiles
stefanDeveloper Sep 26, 2025
98482ac
Fix linting
stefanDeveloper Sep 26, 2025
c7738bd
Remove fill_levels insertion in collector.py
lamr02n Sep 29, 2025
b2c587b
Remove LogCollector fill level panel from dashboard and edit remainin…
lamr02n Sep 29, 2025
42c2f85
Edit LogCollector fill level panels from dashboard (Overview)
lamr02n Sep 29, 2025
f7b0794
Small updates in Log Volumes dashboard
lamr02n Sep 29, 2025
24086fd
Small updates in Overview dashboard
lamr02n Sep 29, 2025
b558c43
Merge pull request #91 from stefanDeveloper/90-log-volume-collector-f…
stefanDeveloper Sep 29, 2025
66e178c
Update pipeline.rst for Stage 1: Log Storage
lamr02n Sep 30, 2025
feb9e6d
Update docstrings for server.py
lamr02n Sep 30, 2025
9f5cf3f
Add developer guide section to readthedocs for better structure
maldwg Sep 30, 2025
cbff8a6
Fix smal typo
maldwg Sep 30, 2025
56f02d1
Merge pull request #94 from stefanDeveloper/documentation/rc1-develop…
stefanDeveloper Sep 30, 2025
42b5834
Update docstrings for server.py
lamr02n Oct 1, 2025
7d0b48c
Update docstrings for server.py (2)
lamr02n Oct 1, 2025
1c69201
Update docstrings for server.py (3)
lamr02n Oct 1, 2025
149105a
Merge remote-tracking branch 'origin/v1.0.0-rc1' into v1.0.0-rc1
lamr02n Oct 1, 2025
720b36f
Update docstrings for server.py (4)
lamr02n Oct 1, 2025
147cf95
Update docstrings for collector.py
lamr02n Oct 1, 2025
1bf7fe0
Update docstrings for batch_handler.py
lamr02n Oct 2, 2025
b181763
Update pipeline.rst for Stage 2: Log Collection
lamr02n Oct 2, 2025
7741f3f
Update pipeline.rst for Stage 2: Log Collection (2)
lamr02n Oct 2, 2025
4b13a7a
Update pipeline.rst for Stage 2: Log Collection (2)
lamr02n Oct 2, 2025
8c016dd
Merge remote-tracking branch 'origin/v1.0.0-rc1' into v1.0.0-rc1
lamr02n Oct 2, 2025
1db94ad
Update pipeline.rst for Stage 2: Log Collection (3)
lamr02n Oct 2, 2025
a217bf2
Update docker compose
stefanDeveloper Oct 2, 2025
0b98e95
Update docker compose changes
stefanDeveloper Oct 2, 2025
d95b611
Fix linting
stefanDeveloper Oct 2, 2025
8edbc3c
Update banner
stefanDeveloper Oct 2, 2025
21af55d
Update quality
stefanDeveloper Oct 2, 2025
af7383c
Update readthedocs
stefanDeveloper Oct 2, 2025
a5c28d2
Update logline format description in configuration.rst
lamr02n Oct 6, 2025
1abf463
Update pipeline.rst for Stage 3: Log Filtering
lamr02n Oct 6, 2025
f22cd30
Update references and underlines in configuration.rst and pipeline.rst
lamr02n Oct 6, 2025
5941c39
Update docstrings for prefilter.py
lamr02n Oct 6, 2025
ceac86d
Update docstrings for logline_handler.py
lamr02n Oct 6, 2025
5d23967
Update docstrings for clickhouse_kafka_sender.py
lamr02n Oct 6, 2025
fe5f545
Update docstrings for utils.py
lamr02n Oct 6, 2025
df686be
Update docstrings for log_config.py
lamr02n Oct 6, 2025
e9db495
Update docstrings for logline_handler.py
lamr02n Oct 6, 2025
ce68ef9
Update docstrings for kafka_handler.py
lamr02n Oct 6, 2025
89a372b
Update docstrings for inspector.py
lamr02n Oct 6, 2025
174eb0f
Update docstrings for detector.py
lamr02n Oct 6, 2025
c813b29
Update docstrings for clickhouse_batch_sender.py
lamr02n Oct 6, 2025
ff1916f
Update docstrings for monitoring_agent.py
lamr02n Oct 6, 2025
28eb2a5
Handle all sphinx warnings
lamr02n Oct 6, 2025
2783a52
Create global variable to make mock_logs.dev.py more adjustable
lamr02n Oct 8, 2025
2b5de4b
Refactor and update README.md
lamr02n Oct 8, 2025
e304d3f
Update README.md (2)
lamr02n Oct 8, 2025
7e7641f
Update docstrings for inspector.py
lamr02n Oct 13, 2025
052fdce
Update docstrings for detector.py
lamr02n Oct 13, 2025
e0e2df7
Update docstrings for dataset.py
lamr02n Oct 13, 2025
a587b99
Update docstrings for explainer.py
lamr02n Oct 13, 2025
ad4cbc7
Update docstrings for feature.py
lamr02n Oct 13, 2025
a895c55
Update docstrings for model.py
lamr02n Oct 13, 2025
5de1114
Fix argument in model.py
lamr02n Oct 13, 2025
84957b2
Update docstrings for train.py
lamr02n Oct 13, 2025
017b04c
Optimize imports for src/train files
lamr02n Oct 13, 2025
823d925
Small docstring fix
lamr02n Oct 14, 2025
1a0b625
Small docstring fixes (2)
lamr02n Oct 14, 2025
e2cafcb
Fix detector feature calculation
stefanDeveloper Oct 14, 2025
69fb4ad
Fix linting
stefanDeveloper Oct 14, 2025
3f6cc5e
Adapt config.yaml to point at external kafka APIs
maldwg Oct 14, 2025
8b0ae07
Merge branch 'v1.0.0-rc1' of github.com:stefanDeveloper/heiDGAF into …
maldwg Oct 14, 2025
f486aad
Update inspector and detector docu
stefanDeveloper Oct 14, 2025
b96543b
Remove too detailed information in pipeline.rst
lamr02n Oct 17, 2025
12c5387
Update Inspector usage section in pipeline.rst
lamr02n Oct 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,12 @@ pipeline:

environment:
kafka_brokers:
- hostname: kafka1
port: 19092
- hostname: kafka2
port: 19093
- hostname: kafka3
port: 19094
- hostname: 127.0.0.1
port: 8097
- hostname: 127.0.0.1
port: 8098
- hostname: 127.0.0.1
port: 8099
kafka_topics:
pipeline:
logserver_in: "pipeline-logserver_in"
Expand Down
57 changes: 46 additions & 11 deletions docs/pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -433,17 +433,28 @@ Stage 4: Inspection
Overview
--------

The `Inspector` stage is responsible to run time-series based anomaly detection on prefiltered batches. This stage
is essentiell to reduce the load on the `Detection` stage. Otherwise, resource complexity increases disproportionately.
The **Inspection** stage performs time-series-based anomaly detection on prefiltered DNS request batches.
Its primary purpose is to reduce the load on the `Detection` stage by filtering out non-suspicious traffic early.

This stage uses StreamAD models—supporting univariate, multivariate, and ensemble techniques—to detect unusual patterns
in request volume and packet sizes.


Main Class
----------

.. py:currentmodule:: src.inspector.inspector
.. autoclass:: Inspector
:members:
:undoc-members:
:show-inheritance:

The :class:`Inspector` class is responsible for:

The :class:`Inspector` is the primary class to run StreamAD models for time-series based anomaly detection, such as the Z-Score outlier detection.
In addition, it features fine-tuning settings for models and anomaly thresholds.
- Loading batches from Kafka
- Extracting time-series features (e.g., frequency and packet size)
- Applying anomaly detection models
- Forwarding suspicious batches to the detector stage

Usage
-----
Expand Down Expand Up @@ -498,9 +509,9 @@ Stage 5: Detection
Overview
--------

The `Detector` resembles the heart of heiDGAF. It runs pre-trained machine learning models to get a probability outcome for the DNS requests.
The pre-trained models are under the EUPL-1.2 license online available.
In total, we rely on the following data sets for the pre-trained models we offer:
The **Detection** stage is the core of the heiDGAF pipeline. It consumes **suspicious batches** passed from the `Inspector`, applies **pre-trained ML models** to classify individual DNS requests, and issues alerts based on aggregated probabilities.

The pre-trained models used here are licensed under **EUPL‑1.2** and built from the following datasets:

- `CIC-Bell-DNS-2021 <https://www.unb.ca/cic/datasets/dns-2021.html>`_
- `DGTA-BENCH - Domain Generation and Tunneling Algorithms for Benchmark <https://data.mendeley.com/datasets/2wzf9bz7xr/1>`_
Expand All @@ -511,15 +522,39 @@ Main Class

.. py:currentmodule:: src.detector.detector
.. autoclass:: Detector
:members:
:undoc-members:
:show-inheritance:

The :class:`Detector` class:

- Consumes a batch flagged as suspicious.
- Downloads and validates the ML model (if necessary).
- Extracts features from domain names (e.g. character distributions, entropy, label statistics).
- Computes a probability per request and an overall risk score per batch.
- Emits alerts to ClickHouse and logs in ``/tmp/warnings.json`` where applicable.

Usage
-----

The :class:`Detector` consumes anomalous batches of requests.
It calculates a probability score for each request, and at last, an overall score of the batch.
Alerts are log to ``/tmp/warnings.json``.
1. The `Detector` listens on the Kafka topic from the Inspector (``inspector_to_detector``).
2. For each suspicious batch:
- Extracts features for every domain request.
- Applies the loaded ML model (after scaling) to compute class probabilities.
- Marks a request as malicious if its probability exceeds the configured `threshold`.
3. Computes an **overall score** (e.g. median of malicious probabilities) for the batch.
4. If malicious requests exist, issues an **alert** record and logs it; otherwise, the batch is filtered.

Alerts are recorded in ClickHouse and also appended to a local JSON file (`warnings.json`) for external monitoring.

Configuration
-------------

In case you want to load self-trained models, the :class:`Detector` needs a URL path, model name, and SHA256 checksum to download the model during start-up.
You may use the provided, pre-trained models or supply your own. To use a custom model, specify:

- `base_url`: URL from which to fetch model artifacts
- `model`: model name
- `checksum`: SHA256 digest for integrity validation
- `threshold`: probability threshold for classifying a request as malicious

These parameters are loaded at startup and used to download, verify, and load the model/scaler if not already cached locally (in temp directory).
112 changes: 53 additions & 59 deletions src/detector/detector.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,84 +244,77 @@ def _get_features(self, query: str) -> np.ndarray:
"""

# Splitting by dots to calculate label length and max length
query = query.strip(".")
label_parts = query.split(".")

levels = {
"fqdn": query,
"secondleveldomain": label_parts[-2] if len(label_parts) >= 2 else "",
"thirdleveldomain": (
".".join(label_parts[:-2]) if len(label_parts) > 2 else ""
),
}

label_length = len(label_parts)
label_max = max(len(part) for part in label_parts)
label_average = len(query.strip("."))
parts = query.split(".")
label_max = len(max(parts, key=str)) if parts else 0
label_average = len(query)

basic_features = np.array(
[label_length, label_max, label_average], dtype=np.float64
)

logger.debug("Get letter frequency")
alc = "abcdefghijklmnopqrstuvwxyz"
query_len = len(query)
freq = np.array(
[query.lower().count(i) / len(query) if len(query) > 0 else 0 for i in alc]
[query.lower().count(c) / query_len if query_len > 0 else 0.0 for c in alc],
dtype=np.float64,
)

logger.debug("Get full, alpha, special, and numeric count.")

def calculate_counts(level: str) -> np.ndarray:
if len(level) == 0:
return np.array([0, 0, 0, 0])

full_count = len(level)
alpha_count = sum(c.isalpha() for c in level) / full_count
numeric_count = sum(c.isdigit() for c in level) / full_count
special_count = (
sum(not c.isalnum() and not c.isspace() for c in level) / full_count
if not level:
return np.array([0.0, 0.0, 0.0, 0.0], dtype=np.float64)

full_count = len(level) / len(level)
alpha_ratio = sum(c.isalpha() for c in level) / len(level)
numeric_ratio = sum(c.isdigit() for c in level) / len(level)
special_ratio = sum(
not c.isalnum() and not c.isspace() for c in level
) / len(level)

return np.array(
[full_count, alpha_ratio, numeric_ratio, special_ratio],
dtype=np.float64,
)

return np.array([full_count, alpha_count, numeric_count, special_count])

levels = {
"fqdn": query,
"thirdleveldomain": label_parts[0] if len(label_parts) > 2 else "",
"secondleveldomain": label_parts[1] if len(label_parts) > 1 else "",
}
counts = {
level: calculate_counts(level_value)
for level, level_value in levels.items()
}

logger.debug(
"Get standard deviation, median, variance, and mean for full, alpha, special, and numeric count."
)
stats = {}
for level, count_array in counts.items():
stats[f"{level}_std"] = np.std(count_array)
stats[f"{level}_var"] = np.var(count_array)
stats[f"{level}_median"] = np.median(count_array)
stats[f"{level}_mean"] = np.mean(count_array)
fqdn_counts = calculate_counts(levels["fqdn"])
third_counts = calculate_counts(levels["thirdleveldomain"])
second_counts = calculate_counts(levels["secondleveldomain"])

logger.debug("Start entropy calculation")
level_features = np.hstack([third_counts, second_counts, fqdn_counts])

def calculate_entropy(s: str) -> float:
if len(s) == 0:
return 0
probabilities = [float(s.count(c)) / len(s) for c in dict.fromkeys(list(s))]
entropy = -sum(p * math.log(p, 2) for p in probabilities)
return entropy

entropy = {level: calculate_entropy(value) for level, value in levels.items()}

logger.debug("Finished entropy calculation")
return 0.0
probs = [s.count(c) / len(s) for c in dict.fromkeys(s)]
return -sum(p * math.log(p, 2) for p in probs)

# Final feature aggregation as a NumPy array
basic_features = np.array([label_length, label_max, label_average])

# Flatten counts and stats for each level into arrays
level_features = np.hstack([counts[level] for level in levels.keys()])
logger.debug("Start entropy calculation")
entropy_features = np.array(
[
calculate_entropy(levels["fqdn"]),
calculate_entropy(levels["thirdleveldomain"]),
calculate_entropy(levels["secondleveldomain"]),
],
dtype=np.float64,
)

# Entropy features
entropy_features = np.array([entropy[level] for level in levels.keys()])
logger.debug("Entropy features calculated")

# Concatenate all features into a single numpy array
all_features = np.concatenate(
[
basic_features,
freq,
# freq_features,
level_features,
# stats_features,
entropy_features,
]
[basic_features, freq, level_features, entropy_features]
)

logger.debug("Finished data transformation")
Expand All @@ -338,8 +331,9 @@ def detect(self) -> None: # pragma: no cover
logger.info("Start detecting malicious requests.")
for message in self.messages:
# TODO predict all messages
# TODO use scalar: self.scaler.transform(self._get_features(message["domain_name"]))
y_pred = self.model.predict_proba(
self.scaler.transform(self._get_features(message["domain_name"]))
self._get_features(message["domain_name"])
)
logger.info(f"Prediction: {y_pred}")
if np.argmax(y_pred, axis=1) == 1 and y_pred[0][1] > THRESHOLD:
Expand Down
Loading