Supporting repository for the paper: "Beyond JA4+: Flow Statistics vs. TLS Fingerprinting for Encrypted Malware Detection" By Márton Pál Lipcsey-Magyar, Attila Ármin Madarász, and Adrian Pekar
The deployment of Encrypted Client Hello (ECH) challenges TLS fingerprinting, a widely used approach for encrypted malware detection, by encrypting the handshake fields these methods rely on. This paper presents a systematic evaluation of flow-based statistical features as a handshake-independent alternative to fingerprinting. Through validation against the official JA4+ implementation, we establish limitations in fingerprinting approaches for this corpus: only 64.9% of malware families possess unique signatures, placing an inherent ceiling on achievable recall in our evaluation. We evaluate flow-level features—packet counts, timing patterns, and size distributions—across 27 experimental configurations on a dataset of 16,542 flows spanning 101 families (59 malware and 42 benign applications). Random Forest classifiers using combined flow statistics and sequential packet length features achieve 98.11% F1-score for binary malware detection with 97.22% recall, substantially exceeding fingerprinting’s theoretical recall bound of 64.9%. For fine-grained family identification, we obtain 54.81% macro F1 across 101 classes and 48.71% macro F1 for malware-only attribution, demonstrating that flow-based methods retain meaningful discriminative power where fingerprinting abstains. Across all tasks, Random Forest consistently outperforms neural networks and k-NN, with performance gaps widening in complex multiclass scenarios. These findings highlight flow-based classification as a practical and reproducible approach that can help maintain network security visibility as ECH deployment progresses, showing that behavioral traffic patterns are expected to provide durable signals for detection even as handshake fields become encrypted.
├── reproduce-research/ # Validation pipelines
│ ├── paper-pipeline/ # Reproduce using original author's data
│ ├── nfstream-pipeline/ # Reproduce using NFStream extraction
│ └── verify-ja4-calculation/ # JA4+ conformance validation
│
└── paper-code/ # Main classification system (Python)
See paper-code/README.md for detailed usage instructions.
| Model | Feature Set | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|---|
| Random Forest | Combined | 97.07% | 99.02% | 97.22% | 98.11% |
| Random Forest | Core | 96.55% | 98.66% | 96.91% | 97.78% |
| Random Forest | SPL | 96.65% | 98.74% | 96.95% | 97.84% |
| Neural Network | Combined | 90.03% | 94.39% | 92.78% | 93.58% |
| Model | Feature Set | Accuracy | Macro F1 |
|---|---|---|---|
| Random Forest | Combined | 61.62% | 54.81% |
| Random Forest | Core | 59.66% | 52.39% |
| FAISS k-NN | Combined | 43.97% | 34.30% |
| Metric | TLS Fingerprinting (JA4+JA4S+SNI) | Flow-Based ML (RF+Combined) |
|---|---|---|
| Recall | ≤64.9% (theoretical max) | 97.22% |
| F1-Score | ≤78.6% | 98.11% |
| Handshake-Independent | No | Yes |
| Malware Coverage | 64.9% | 100% |
Open paper-code/notebooks/experiments.ipynb for an interactive notebook with all experiments, visualizations, and analysis.
The experiments use the malware traffic dataset from:
Matoušek, P., Přívora, J., & Ryšavý, O. (2024). "TLS Traffic Analysis: Malware Classification with JA4+ Fingerprints"
Dataset characteristics:
- 16,542 flows across 101 families (59 malware, 42 benign)
- Sources: Desktop malware, mobile malware, desktop apps, mobile apps
- Authenticated and labeled network traces
Note: Processed CSV files with extracted features are available under reproduce-research/. For original PCAPs, refer to Matoušek et al..
- Volumetric: Packet counts, byte volumes (bidirectional, src→dst, dst→src)
- Temporal: Flow duration per direction
- Statistical: Packet size distributions (min, mean, stddev, max)
- Timing: Packet inter-arrival times (PIAT) distributions
- First 25 packet sizes in arrival order
- Captures protocol-specific patterns
- Early detection capability
- Synergy between macro-level (flow stats) and micro-level (SPL) patterns
- Best performance across all tasks
- 3 Classification Tasks: Binary, Full Multiclass (101 classes), Malware-only (59 classes)
- 3 Feature Sets: Core (33), SPL (25), Combined (58)
- 3 ML Models: Neural Network, Random Forest, FAISS k-NN
- Total: 27 experimental configurations
- Reproducibility: Fixed random seeds (42), stratified 80/20 splits
- Márton Pál Lipcsey-Magyar - Budapest University of Technology and Economics
- Attila Ármin Madarász - Budapest University of Technology and Economics
- Adrian Pekar - Budapest University of Technology and Economics & CUJO LLC
For questions about the paper or code:
- Adrian Pekar: [email protected]
Supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences and Celtic-Next project RAI-6Green: Robust and AI Native 6G for Green Networks (C2023/1-9, funded by 2024-1.2.6-EUREKA-2024-00009).
Note: This repository contains the complete implementation and validation pipelines supporting the paper. All experimental results are reproducible using the provided code and methodology.