Skip to content

Commit a876c1d

Browse files
committed
init commit
0 parents  commit a876c1d

File tree

9 files changed

+5337
-0
lines changed

9 files changed

+5337
-0
lines changed

.gitignore

Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
# File created using '.gitignore Generator' for Visual Studio Code: https://bit.ly/vscode-gig
2+
# Created by https://www.toptal.com/developers/gitignore/api/macos,python
3+
# Edit at https://www.toptal.com/developers/gitignore?templates=macos,python
4+
5+
### macOS ###
6+
# General
7+
.DS_Store
8+
.AppleDouble
9+
.LSOverride
10+
11+
# Icon must end with two \r
12+
Icon
13+
14+
15+
# Thumbnails
16+
._*
17+
18+
# Files that might appear in the root of a volume
19+
.DocumentRevisions-V100
20+
.fseventsd
21+
.Spotlight-V100
22+
.TemporaryItems
23+
.Trashes
24+
.VolumeIcon.icns
25+
.com.apple.timemachine.donotpresent
26+
27+
# Directories potentially created on remote AFP share
28+
.AppleDB
29+
.AppleDesktop
30+
Network Trash Folder
31+
Temporary Items
32+
.apdisk
33+
34+
### macOS Patch ###
35+
# iCloud generated files
36+
*.icloud
37+
38+
### Python ###
39+
# Byte-compiled / optimized / DLL files
40+
__pycache__/
41+
*.py[cod]
42+
*$py.class
43+
44+
# C extensions
45+
*.so
46+
47+
# Distribution / packaging
48+
.Python
49+
build/
50+
develop-eggs/
51+
dist/
52+
downloads/
53+
eggs/
54+
.eggs/
55+
lib/
56+
lib64/
57+
parts/
58+
sdist/
59+
var/
60+
wheels/
61+
share/python-wheels/
62+
*.egg-info/
63+
.installed.cfg
64+
*.egg
65+
MANIFEST
66+
67+
# PyInstaller
68+
# Usually these files are written by a python script from a template
69+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
70+
*.manifest
71+
*.spec
72+
73+
# Installer logs
74+
pip-log.txt
75+
pip-delete-this-directory.txt
76+
77+
# Unit test / coverage reports
78+
htmlcov/
79+
.tox/
80+
.nox/
81+
.coverage
82+
.coverage.*
83+
.cache
84+
nosetests.xml
85+
coverage.xml
86+
*.cover
87+
*.py,cover
88+
.hypothesis/
89+
.pytest_cache/
90+
cover/
91+
92+
# Translations
93+
*.mo
94+
*.pot
95+
96+
# Django stuff:
97+
*.log
98+
local_settings.py
99+
db.sqlite3
100+
db.sqlite3-journal
101+
102+
# Flask stuff:
103+
instance/
104+
.webassets-cache
105+
106+
# Scrapy stuff:
107+
.scrapy
108+
109+
# Sphinx documentation
110+
docs/_build/
111+
112+
# PyBuilder
113+
.pybuilder/
114+
target/
115+
116+
# Jupyter Notebook
117+
.ipynb_checkpoints
118+
119+
# IPython
120+
profile_default/
121+
ipython_config.py
122+
123+
# pyenv
124+
# For a library or package, you might want to ignore these files since the code is
125+
# intended to run in multiple environments; otherwise, check them in:
126+
# .python-version
127+
128+
# pipenv
129+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
130+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
131+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
132+
# install all needed dependencies.
133+
#Pipfile.lock
134+
135+
# poetry
136+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
137+
# This is especially recommended for binary packages to ensure reproducibility, and is more
138+
# commonly ignored for libraries.
139+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
140+
#poetry.lock
141+
142+
# pdm
143+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
144+
#pdm.lock
145+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
146+
# in version control.
147+
# https://pdm.fming.dev/#use-with-ide
148+
.pdm.toml
149+
150+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
151+
__pypackages__/
152+
153+
# Celery stuff
154+
celerybeat-schedule
155+
celerybeat.pid
156+
157+
# SageMath parsed files
158+
*.sage.py
159+
160+
# Environments
161+
.env
162+
.venv
163+
env/
164+
venv/
165+
ENV/
166+
env.bak/
167+
venv.bak/
168+
169+
# Spyder project settings
170+
.spyderproject
171+
.spyproject
172+
173+
# Rope project settings
174+
.ropeproject
175+
176+
# mkdocs documentation
177+
/site
178+
179+
# mypy
180+
.mypy_cache/
181+
.dmypy.json
182+
dmypy.json
183+
184+
# Pyre type checker
185+
.pyre/
186+
187+
# pytype static type analyzer
188+
.pytype/
189+
190+
# Cython debug symbols
191+
cython_debug/
192+
193+
# PyCharm
194+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
195+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
196+
# and can be added to the global gitignore or merged into this file. For a more nuclear
197+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
198+
#.idea/
199+
200+
### Python Patch ###
201+
# Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
202+
poetry.toml
203+
204+
# ruff
205+
.ruff_cache/
206+
207+
# End of https://www.toptal.com/developers/gitignore/api/macos,python
208+
209+
# Custom rules (everything added below won't be overriden by 'Generate .gitignore File' if you use 'Update' option)
210+

README.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Tool Assisted Manual Data Set on Vulnerability Introducing Commits
2+
3+
Research in identifying vulnerabilities and the commits that introduce them is ongoing. However, many current methods rely heavily on automation, which can lead to a high rate of false positives and require significant error-checking. To address this issue, we developed a tool-assisted pipeline to manually review and examine vulnerabilities and their corresponding commits. Additionally, we collected relevant metadata such as modified lines of code, and the mapping of CVE and CWE categories. This data set can be used to validate automated methods like machine learning approaches.
4+
5+
## Table of Contents
6+
7+
- [Tool Assisted Manual Data Set on Vulnerability Introducing Commits](#tool-assisted-manual-data-set-on-vulnerability-introducing-commits)
8+
- [Table of Contents](#table-of-contents)
9+
- [Dataset Description](#dataset-description)
10+
- [JSON Fields](#json-fields)
11+
- [Example](#example)
12+
- [Review Pipeline Instructions](#review-pipeline-instructions)
13+
- [Prerequisites](#prerequisites)
14+
- [Setup](#setup)
15+
- [Usage](#usage)
16+
- [Input Dataset](#input-dataset)
17+
18+
[![DOI](https://zenodo.org/badge/587266310.svg)](https://zenodo.org/badge/latestdoi/587266310)
19+
20+
## Dataset Description
21+
22+
The complete dataset can be found [here](/dataset/).
23+
24+
It is structured in an JSON file with the following fields:
25+
### JSON Fields
26+
| Fieldname | Brief |
27+
| --- | --- |
28+
|cwe| Common Weakness Enumeration ID |
29+
|introducing| Commit hash that introduces the vulnerability |
30+
|intro_stats| Number of lines added/deleted in the introducing commit |
31+
|intro_lines| Lines marked as vulnerable in the introducing commit |
32+
|fixing_stats| Number of lines added/deleted in the fixing commits |
33+
|fixing_lines| Lines marked as fixing the vulnerability in the fixing commit |
34+
|days_between| Days between the identified introducing and fixing commits |
35+
36+
### Example
37+
```json
38+
{
39+
"cve": "CVE-2019-11274",
40+
"cwe": "CWE-79",
41+
"repository": "https://github.com/cloudfoundry/uaa",
42+
"fixing": [
43+
"a34f55fc97a81966faf21e3ae404ec24f1f31cf7"
44+
],
45+
"introducing": "bb8ff8f4e8969b46fdacffcd27781d223c8c7244",
46+
"intro_stats": {
47+
"bb8ff8f4e8969b46fdacffcd27781d223c8c7244": {
48+
"add": 320,
49+
"del": 7
50+
}
51+
},
52+
"fixing_stats": {
53+
"a34f55fc97a81966faf21e3ae404ec24f1f31cf7": {
54+
"add": 68,
55+
"del": 17
56+
}
57+
},
58+
"days_between": 1836,
59+
"fixing_lines": {
60+
"server/src/main/java/org/cloudfoundry/identity/uaa/scim/endpoints/ScimGroupEndpoints.java": "168"
61+
},
62+
"introducing_lines": {
63+
"scim/src/main/java/org/cloudfoundry/identity/uaa/scim/endpoints/ScimGroupEndpoints.java": "190"
64+
}
65+
},
66+
67+
```
68+
69+
## Review Pipeline Instructions
70+
71+
### Prerequisites
72+
| Software | Used Version |
73+
| --- | --- |
74+
| Python3 |3.10.8 |
75+
| pip3 | 22.3.1 |
76+
| git | 2.29.0 |
77+
| Webbrowser of choice | Safari 16.1|
78+
79+
### Setup
80+
In order to install all required python packages please run the following command inside the `tool/src` directory:
81+
- `python3 -m pip install -R requirements.txt`
82+
83+
### Usage
84+
The pipeline can be executed by the following command inside the `tool/src` directory:
85+
- `python3 manual_analysis_pipeline.py <path_to_input_dataset>`
86+
87+
### Input Dataset
88+
The input dataset is expected to be a JSON file with the following fields:
89+
90+
| Fieldname | Brief |
91+
| --- | --- |
92+
|cve_id| CVE id of the vulnerability|
93+
|repository| URL to the repository |
94+
|fixing_commits| List of fixing commit SHA-1 hashes |

citation.cff

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
cff-version: 1.2.0
2+
message: "If you use this dataset or tool, please cite it as below."
3+
preferred-citation: dataset
4+
authors:
5+
- family-names: "Hinrichs"
6+
given-names: "Torge"
7+
orcid: "https://orcid.org/0000-0001-7489-3540"
8+
- family-names: "Scandariato"
9+
given-names: "Riccardo"
10+
orcid: "https://orcid.org/0000-0003-3591-7671"
11+
title: "Tool Assisted Manual Dataset on Vulnerability Introducing Commits"
12+
version: 0.1.0
13+
doi: 10.5281/zenodo.7565542
14+
date-released: 25.01.2023
15+
url: "https://github.com/tuhh-softsec/Tool-Assisted-Manual-Dataset-on-Vulnerability-Introducing-Commits"

0 commit comments

Comments
 (0)