Skip to content

Commit 4761761

Browse files
Merge pull request #89 from stefanDeveloper/v1.0.0-rc1
Draft: Prepare v1.0.0-rc1
2 parents f6462cb + 12c5387 commit 4761761

File tree

64 files changed

+2237
-2071
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+2237
-2071
lines changed

README.md

Lines changed: 124 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,6 @@
2828
<a href="https://heidgaf.readthedocs.io/en/latest/"><strong>Explore the docs »</strong></a>
2929
<br />
3030
<br />
31-
<a href="https://mybinder.org/v2/gh/stefanDeveloper/heiDGAF-tutorials/HEAD?labpath=demo_notebook.ipynb">View Demo</a>
32-
·
3331
<a href="https://github.com/stefanDeveloper/heiDGAF/issues/new?labels=bug&template=bug-report---.md">Report Bug</a>
3432
·
3533
<a href="https://github.com/stefanDeveloper/heiDGAF/issues/new?labels=enhancement&template=feature-request---.md">Request Feature</a>
@@ -58,23 +56,78 @@
5856

5957
## About the Project
6058

61-
![Pipeline overview](https://raw.githubusercontent.com/stefanDeveloper/heiDGAF/main/docs/media/pipeline_overview.png?raw=true)
59+
![Pipeline overview](https://raw.githubusercontent.com/stefanDeveloper/heiDGAF/main/docs/media/heidgaf_overview_detailed.drawio.png?raw=true)
6260

6361
## Getting Started
6462

65-
If you want to use heiDGAF, just use the provided Docker compose to quickly bootstrap your environment:
63+
#### Run **heiDGAF** using Docker Compose:
6664

67-
```
68-
docker compose -f docker/docker-compose.yml up
65+
```sh
66+
HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml up
6967
```
7068
<p align="center">
7169
<img src="https://raw.githubusercontent.com/stefanDeveloper/heiDGAF/main/assets/terminal_example.gif?raw=true" alt="Terminal example"/>
7270
</p>
7371

74-
## Examplary Dashboards
75-
In the below summary you will find examplary views of the grafana dashboards. The metrics were obtained using the [mock-generator](./docker/docker-compose.send-real-logs.yml)
72+
#### Or run the modules locally on your machine:
73+
```sh
74+
python -m venv .venv
75+
source .venv/bin/activate
76+
77+
sh install_requirements.sh
78+
```
79+
Alternatively, you can use `pip install` and enter all needed requirements individually with `-r requirements.*.txt`.
80+
81+
Now, you can start each stage, e.g. the inspector:
82+
83+
```sh
84+
python src/inspector/inspector.py
85+
```
86+
87+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
88+
89+
90+
## Usage
91+
92+
### Configuration
93+
94+
To configure **heiDGAF** according to your needs, use the provided `config.yaml`.
95+
96+
The most relevant settings are related to your specific log line format, the model you want to use, and
97+
possibly infrastructure.
98+
99+
The section `pipeline.log_collection.collector.logline_format` has to be adjusted to reflect your specific input log
100+
line format. Using our adjustable and flexible log line configuration, you can rename, reorder and fully configure each
101+
field of a valid log line. Freely define timestamps, RegEx patterns, lists, and IP addresses. For example, your
102+
configuration might look as follows:
103+
104+
```yml
105+
- [ "timestamp", Timestamp, "%Y-%m-%dT%H:%M:%S.%fZ" ]
106+
- [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
107+
- [ "client_ip", IpAddress ]
108+
- [ "dns_server_ip", IpAddress ]
109+
- [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.)+[A-Za-z]{2,63}$' ]
110+
- [ "record_type", ListItem, [ "A", "AAAA" ] ]
111+
- [ "response_ip", IpAddress ]
112+
- [ "size", RegEx, '^\d+b$' ]
113+
```
114+
115+
The options `pipeline.data_inspection` and `pipeline.data_analysis` are relevant for configuring the model. The section
116+
`environment` can be fine-tuned to prevent naming collisions for Kafka topics and adjust addressing in your environment.
117+
118+
For more in-depth information on your options, have a look at our
119+
[official documentation](https://heidgaf.readthedocs.io/en/latest/usage.html), where we provide tables explaining all
120+
values in detail.
121+
122+
### Monitoring
123+
To monitor the system and observe its real-time behavior, multiple Grafana dashboards have been set up.
124+
125+
Have a look at the following pictures showing examples of how these dashboards might look at runtime.
126+
76127
<details>
77-
<summary>📊 <strong>Overview Dashboard</strong></summary>
128+
<summary><strong>Overview</strong> dashboard</summary>
129+
130+
Contains the most relevant information on the system's runtime behavior, its efficiency and its effectivity.
78131

79132
<p align="center">
80133
<a href="./assets/readme_assets/overview.png">
@@ -85,7 +138,10 @@ In the below summary you will find examplary views of the grafana dashboards. Th
85138
</details>
86139

87140
<details>
88-
<summary>📈 <strong>Latencies Dashboard</strong></summary>
141+
<summary><strong>Latencies</strong> dashboard</summary>
142+
143+
Presents any information on latencies, including comparisons between the modules and more detailed,
144+
stand-alone metrics.
89145

90146
<p align="center">
91147
<a href="./assets/readme_assets/latencies.jpeg">
@@ -96,7 +152,11 @@ In the below summary you will find examplary views of the grafana dashboards. Th
96152
</details>
97153

98154
<details>
99-
<summary>📉 <strong>Log Volumes Dashboard</strong></summary>
155+
<summary><strong>Log Volumes</strong> dashboard</summary>
156+
157+
Presents any information on the fill levels of each module, i.e. the number of entries that are currently in the
158+
module for processing. Includes comparisons between the modules, more detailed, stand-alone metrics, as well as
159+
total numbers of logs entering the pipeline or being marked as fully processed.
100160

101161
<p align="center">
102162
<a href="./assets/readme_assets/log_volumes.jpeg">
@@ -107,7 +167,9 @@ In the below summary you will find examplary views of the grafana dashboards. Th
107167
</details>
108168

109169
<details>
110-
<summary>🚨 <strong>Alerts Dashboard</strong></summary>
170+
<summary><strong>Alerts</strong> dashboard</summary>
171+
172+
Presents details on the number of logs detected as malicious including IP addresses responsible for those alerts.
111173

112174
<p align="center">
113175
<a href="./assets/readme_assets/alerts.png">
@@ -118,7 +180,12 @@ In the below summary you will find examplary views of the grafana dashboards. Th
118180
</details>
119181

120182
<details>
121-
<summary>🧪 <strong>Dataset Dashboard</strong></summary>
183+
<summary><strong>Dataset</strong> dashboard</summary>
184+
185+
This dashboard is only active for the **_datatest_** mode. Users who want to test their own models can use this mode
186+
for inspecting confusion matrices on testing data.
187+
188+
> This feature is in a very early development stage.
122189

123190
<p align="center">
124191
<a href="./assets/readme_assets/datatests.png">
@@ -128,131 +195,87 @@ In the below summary you will find examplary views of the grafana dashboards. Th
128195

129196
</details>
130197

131-
132-
### Developing
133-
134-
Install all Python requirements:
135-
136-
```sh
137-
python -m venv .venv
138-
source .venv/bin/activate
139-
140-
sh install_requirements.sh
141-
```
142-
143-
Alternatively, you can use `pip install` and enter all needed requirements individually with `-r requirements.*.txt`.
144-
145-
Now, you can start each stage, e.g. the inspector:
146-
147-
```sh
148-
python src/inspector/main.py
149-
```
150198
<p align="right">(<a href="#readme-top">back to top</a>)</p>
151199

152-
### Configuration
153-
154-
The following table lists the most important configuration parameters with their respective default values.
155-
The full list of configuration parameters is available at the [documentation](https://heidgaf.readthedocs.io/en/latest/usage.html)
156-
157-
| Path | Description | Default Value |
158-
| :----------------------------------------- | :-------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------- |
159-
| `pipeline.data_inspection.inspector.mode` | Mode of operation for the data inspector. | `univariate` (options: `multivariate`, `ensemble`) |
160-
| `pipeline.data_inspection.inspector.ensemble.model` | Model to use when inspector mode is `ensemble`. | `WeightEnsemble` |
161-
| `pipeline.data_inspection.inspector.ensemble.module` | Module name for the ensemble model. | `streamad.process` |
162-
| `pipeline.data_inspection.inspector.models` | List of models to use for data inspection (e.g., anomaly detection). | Array of model definitions (e.g., `{"model": "ZScoreDetector", "module": "streamad.model", "model_args": {"is_global": false}}`)|
163-
| `pipeline.data_inspection.inspector.anomaly_threshold` | Threshold for classifying an observation as an anomaly. | `0.01` |
164-
| `pipeline.data_analysis.detector.model` | Model to use for data analysis (e.g., DGA detection). | `rf` (Random Forest) option: `XGBoost` |
165-
| `pipeline.data_analysis.detector.checksum` | Checksum for the model file to ensure integrity. | `021af76b2385ddbc76f6e3ad10feb0bb081f9cf05cff2e52333e31040bbf36cc` |
166-
| `pipeline.data_analysis.detector.base_url` | Base URL for downloading the model if not present locally. | `https://heibox.uni-heidelberg.de/d/0d5cbcbe16cd46a58021/` |
167-
168-
<p align="right">(<a href="#readme-top">back to top</a>)</p>
169200

170-
### Insert test data
201+
## Models and Training
171202

172-
>[!IMPORTANT]
173-
> To be able to train and test our or your own models, you will need to download the datasets.
203+
To train and test our and possibly your own models, we currently rely on the following datasets:
174204

175-
For training our models, we currently rely on the following data sets:
176205
- [CICBellDNS2021](https://www.unb.ca/cic/datasets/dns-2021.html)
177206
- [DGTA Benchmark](https://data.mendeley.com/datasets/2wzf9bz7xr/1)
178207
- [DNS Tunneling Queries for Binary Classification](https://data.mendeley.com/datasets/mzn9hvdcxg/1)
179208
- [UMUDGA - University of Murcia Domain Generation Algorithm Dataset](https://data.mendeley.com/datasets/y8ph45msv8/1)
180-
- [Real-CyberSecurity-Datasets](https://github.com/gfek/Real-CyberSecurity-Datasets/)
209+
- [DGArchive](https://dgarchive.caad.fkie.fraunhofer.de/)
181210

182-
However, we compute all feature separately and only rely on the `domain` and `class`.
183-
Currently, we are only interested in binary classification, thus, the `class` is either `benign` or `malicious`.
211+
We compute all features separately and only rely on the `domain` and `class` for binary classification.
184212

185-
After downloading the dataset and storing it under `<project-root>/data` you can run
186-
```
187-
docker compose -f docker/docker-compose.send-real-logs.yml up
188-
```
189-
to start inserting the dataset traffic.
213+
### Inserting Data for Testing
190214

191-
<p align="right">(<a href="#readme-top">back to top</a>)</p>
215+
For testing purposes, we provide multiple scripts in the `scripts` directory. Use `real_logs.dev.py` to send data from
216+
the datasets into the pipeline. After downloading the dataset and storing it under `<project-root>/data`, run
217+
```sh
218+
python scripts/real_logs.dev.py
219+
```
220+
to start continuously inserting dataset traffic.
192221

222+
### Training Your Own Models
193223

194-
### Train your own models
195224
> [!IMPORTANT]
196225
> This is only a brief wrap-up of a custom training process.
197226
> We highly encourage you to have a look at the [documentation](https://heidgaf.readthedocs.io/en/latest/training.html)
198227
> for a full description and explanation of the configuration parameters.
199228

200-
Currently, we feature two trained models, namely XGBoost and RandomForest.
229+
We feature two trained models:
230+
1. XGBoost (`src/train/model.py#XGBoostModel`) and
231+
2. RandomForest (`src/train/model.py#RandomForestModel`).
232+
233+
After installing the requirements, use `src/train/train.py`:
201234

202235
```sh
203-
python -m venv .venv
204-
source .venv/bin/activate
236+
> python -m venv .venv
237+
> source .venv/bin/activate
205238
206-
pip install -r requirements/requirements.train.txt
207-
```
239+
> pip install -r requirements/requirements.train.txt
208240
209-
After setting up the [dataset directories](#insert-test-data) (and adding the code for your model class if applicable), you can start the training process by running the following commands:
241+
> python src/train/train.py
242+
Usage: train.py [OPTIONS] COMMAND [ARGS]...
210243
211-
**Model Training**
212-
```
213-
python src/train/train.py train --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name>
214-
```
215-
The results will be saved per default to `./results`, if not configured otherwise. <br>
244+
Options:
245+
-h, --help Show this message and exit.
216246
217-
**Model Tests**
218-
```
219-
python src/train/train.py test --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>
247+
Commands:
248+
explain
249+
test
250+
train
220251
```
221252

222-
**Model Explain**
223-
```
224-
python src/train/train.py explain --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>
225-
```
226-
This will create a rules.txt file containing the innards of the model, explaining the rules it created.
253+
Setting up the [dataset directories](#insert-test-data) (and adding the code for your model class if applicable) lets you start
254+
the training process by running the following commands:
227255

228-
<p align="right">(<a href="#readme-top">back to top</a>)</p>
256+
#### Model Training
229257

258+
```sh
259+
> python src/train/train.py train --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name>
260+
```
261+
The results will be saved per default to `./results`, if not configured otherwise.
230262

231-
### Data
232-
233-
> [!IMPORTANT]
234-
> We support custom schemes.
235-
236-
Depending on your data and usecase, you can customize the data scheme to fit your needs.
237-
The below configuration is part of the [main configuration file](./config.yaml) which is detailed in our [documentation](https://heidgaf.readthedocs.io/en/latest/usage.html#id2)
263+
#### Model Tests
238264

239-
```yml
240-
loglines:
241-
fields:
242-
- [ "timestamp", RegEx, '^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z$' ]
243-
- [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
244-
- [ "client_ip", IpAddress ]
245-
- [ "dns_server_ip", IpAddress ]
246-
- [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.)+[A-Za-z]{2,63}$' ]
247-
- [ "record_type", ListItem, [ "A", "AAAA" ] ]
248-
- [ "response_ip", IpAddress ]
249-
- [ "size", RegEx, '^\d+b$' ]
265+
```sh
266+
> python src/train/train.py test --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>
250267
```
251268

269+
#### Model Explain
252270

271+
```sh
272+
> python src/train/train.py explain --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>
273+
```
274+
This will create a `rules.txt` file containing the innards of the model, explaining the rules it created.
253275

254276
<p align="right">(<a href="#readme-top">back to top</a>)</p>
255277

278+
256279
<!-- CONTRIBUTING -->
257280
## Contributing
258281

assets/heidgaf_logo_github.png

51.3 KB
Loading

config.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,11 +70,11 @@ pipeline:
7070

7171
environment:
7272
kafka_brokers:
73-
- hostname: kafka1
73+
- hostname: 127.0.0.1
7474
port: 8097
75-
- hostname: kafka2
75+
- hostname: 127.0.0.1
7676
port: 8098
77-
- hostname: kafka3
77+
- hostname: 127.0.0.1
7878
port: 8099
7979
kafka_topics:
8080
pipeline:

data/.gitkeep

Whitespace-only changes.

data/cic/.gitkeep

Whitespace-only changes.

data/cic/cic_dns_decode.py

Lines changed: 0 additions & 26 deletions
This file was deleted.

data/dgta/.gitkeep

Whitespace-only changes.

data/dgta/dgta_decode.py

Lines changed: 0 additions & 17 deletions
This file was deleted.

docker/benchmark_tests/Dockerfile.run_test

Lines changed: 0 additions & 17 deletions
This file was deleted.

0 commit comments

Comments
 (0)