This is the supporting page for the research titled, "Taming Volatility: Stable and Private QUIC Classification with Federated Learning."
The work presents a methodology for processing and classifying QUIC network traffic flows with Federated Learning under a real-world non-IID dataset.
In line with principles of reproducibility and transparency, and to support open science, the scripts utilized in the experiments are available to the public. The dataset generated for this research is also accessible. These resources aim to aid a more comprehensive understanding of the methodologies employed and foster additional research in this field.
The project used the CESNET-QUIC22 realworld dataset. For using it, download the zip file, unzip it, rename the root folder to 0-cesnet-quic22 and put it into ~/datasets/0-cesnet-quic22. The folder should look like this:
Reproducing the dataset requires the prefixes-orgs.csv file which maps the flows IP address to a organizational ID. However, for privacy reasons this is not public. If you want to reproduce this dataset, please contact the authors! If you don't have the mapping, the methodology will use random IDs for each flow record.
The configuration.py holds the static key values used in multiple notebooks and files. It is important to set the self._path_home variable to the exact path to this projects root folder (FL-QUIC-TC) as everything save/load/read/write handled relative to this path.
In our repository, the files and notebooks are organized as follows:
- datasets: Folder for the
- 0-cesnet-quic22: Home folder for the dataset. Please download it, unzip it and put it here while renaming its root folder to 0-cesnet-quic22.
- 1-filtered: Daily datasets only with target application labels and organizations ID.
- 2-eval: Expand and save the PPI and histogram features into multiple columns.
- 3-features: Calculate additionaly features: SUBPSTATS, SUBFLOWSTATS
- 4-dataset: The time sorted datasets: full dataset (dataset.parquet), dataset per clients (org-{client_id}.parquet) and the chunkified CL dataset under CL/.
- 5-federated: Federated data chunks for each scenario.
- results
- class_labeling: Holds the class to ID mappings, which are resulted from label encoding.
- global_models: Holds each global models aggregated after each FL rounds for each FL scenarios.
- scenarios: Holds the results for each scenario
- CL: Hold the results for each central learning scenario
- 1-prepare-dataset.ipynb: This notebook creates the CL data and FL scenarios data chunks in 5-steps from the CESNET-QUIC22 dataset. Step-1 requires the prefixes-orgs.csv.
- 2-dataset-visualization.ipynb: Visualizes the important features of the created dataset, such as traffic distribution between clients, apps and clients+apps. Uses data from /datasets/4-dataset.
- 3-central-learning.ipynb: In this notebook, one can execute the CL pipeline with selected features. The scenarios and their features used in the work are already written in the code as a dict (CL_CASE_FEATURES). As a result, it executes a CL and then creates and saves its results.
- 4-shap.ipynb: In this notebook, one can choose a CL scenario and analyse it with SHAP. As a result, it will create a SHAP diagram for the top 20 most influental feature.
- 5-federated-learning.ipynb: Executes a FL scenario on a choosen FL dataset with the choosen aggregation algos. While it executes the FL, it exports the results after every round
- 6-federated-visualizations.ipynb: Visualizes the overall comparison of the executed aggregation algos performance as well as the groupped client F1-scores used for the research.
- configuration.py: Describes the Configuration utility class which holds key static parameters. The most important is the _path_home variable, don't forget to set it to your "FL-QUIC-TC" folder's full path. The work results achieved by the given values.
- federated_clients.py: Describes the FederatedClient class which used as a client instance in the Flower federated learning.
- federated_metrics.py: Describes the MetricsTracker class and other helpful functions for metric collection, calculation and visualization.
- model.py. Describes the used PyTorch model fully-connected neural network architecture.
- prefixes-orgs.csv: IP prefix - Organizations mapping. Not publicly included as it is not public. Contact the authors please if you want to reproduce the dataset.
- .gitignore: Ignore huge files and the private mapping file in git.
- requirements.txt: The exact python libraries and their versions used originally. Created and exported from project's virtual python environment via "pip freeze > requirements.txt".
