Releases: vulnerability-lookup/VulnTrain
Releases · vulnerability-lookup/VulnTrain
Release 2.2.0
Training
- New CLI options for severity classification trainer (
classify_severity.py):--no-codecarbon: Disable CodeCarbon emissions tracking.--no-push: Disable pushing the model and tokenizer to Hugging Face Hub.--no-cache: Disable cache for the model during training.
Release 2.1.0
What's New
Datasets
- CWE/Patch dataset improvements: Considered more fields to find vulnerability patches. Asynchronous requests to GitHub are now less aggressive.
- CWE Guesser dataset:
- Now uses the new vulnerability endpoint of Vulnerability-Lookup.
- References in security advisories without the
patchtag are also considered. - Repo ID is now a configurable parameter in the dataset generation script.
- URL handling improvements:
normalize_patch_urlfunction improved for better patch URL processing.- URLs with fragments are now properly handled.
- Concurrency: Reduced the number of default concurrent requests to 12 to avoid overloading external services.
Dependencies
- Updated Python dependencies, including PyTorch bump from 2.7.1 to 2.8.0.
- General dependency updates across the project.
Miscellaneous
- Minor code improvements and style updates (reformatted with
black).
Release 2.0.0
News
- Dataset generation: Introduced a new script to build datasets of structured vulnerabilities enriched with CWE identifiers and corresponding patches.
Each entry now includes the Git commit message and the full diff (Base64-encoded).
#10 by @3LS3-1F - Model generation: Added a new trainer for predicting CWE classifications from vulnerability descriptions and associated patches (commit messages).
#10 by @3LS3-1F
Related resources shared via Hugging Face: https://huggingface.co/collections/CIRCL/vlai-for-cwe-guessing-68bab22e3d71b513146d13b3
Changes
- Improved documentation and reorganized modules for better clarity and maintainability.
- Updated dependencies to their latest stable versions.
Release 1.5.0
Release 1.4.0
This version adds support for creating new AI-ready datasets based on the China National Vulnerability Database (CNVD). It also introduces a new training module designed to classify vulnerabilities using text classification models tailored for CNVD data. By default hfl/chinese-macbert-base is used but it is possible to use hfl/chinese-bert-wwm-ext or google-bert/bert-base-chinese.
By @3LS3-1F
Release 1.3.1
Updated dependencies and fixed issues due to changes in transformers.
Release 1.3.0
Changes
- Updated dependencies.
Release 1.2.0
Changes
- Dataset generation: CVSS are now extracted from GitHub and PySec security advisories.
- Dataset generation: CVSS, CPE, title and description (summary) are now extracted from CSAF document.
Release 1.1.0
News
- Trainers: Support of roberta-base for the text classifier with improved
settings for TrainingArguments. - Validators: Validator for severity classification.
Release 1.0.0
News
- Introduced a new trainer to automatically classify vulnerabilities based on their descriptions,
even when CVSS scores are unavailable. - Added CVSS parsing to the dataset generation script.
Changes
- Refactored the project structure for better organization.
- Improved CPE parsing.
- Enhanced the dataset generation script.
- Optimized the trainer for text generation on vulnerability descriptions.
- Improved command-line argument parsing.
- Improved the process of pushing the tokenizer and trainer to Hugging Face.