VulZoo is a large-scale vulnerability intelligence dataset that integrates various sources of structural and non-structural data. It is designed to be used by security researchers, penetration testers, and security analysts to get a comprehensive view of vulnerabilities and their associated data.
This dataset is divided into two parts: raw data and processed data.
raw-data/
: contains the raw data from different sources.processed/
: contains the processed data that is extracted or converted from the raw data.
VulZoo aims to provide the most comprehensive profiling of vulnerabilities for downstream tasks, e.g., vulnerability detection, assessment, prioritization, exploitation, and mitigation.
The following figure shows the conceptual overview of VulZoo:
README.md in processed/ provides more details about the processed data.
If the existing data in VulZoo satisfies your demand, you can just clone this repository without --recurse-submodules
option:
git clone https://github.com/NUS-Curiosity/VulZoo
The dataset is in processed/
directory. If you need the up-to-date data, please following the data management process below.
Pre-requisites:
- Python 3.6+
- Disk space: 25GB+
VulZoo is composed of both git-based and non-git-based sources. The git-based sources are from upstream repositories and organized as git submodules in this repository. The non-git-based sources are crawled and maintained in this repository. To get started, clone the repository with the following command:
git clone --recurse-submodules https://github.com/NUS-Curiosity/VulZoo
VulZoo provides some useful scripts to help you manage the data. As some scripts require specific Python packages, it is recommended to install the required packages first:
pip install -r requirements.txt
You can run the sync-raw-data.sh
script to incrementally update the local raw data:
./sync-raw-data.sh
Then, you can run the sync-processed.sh
script to process the raw data and synchronize the processed data with the latest raw data:
./sync-processed.sh
P.S.
- You can run
print-statistics.py
to get the statistics of the processed data. - The updating of
attackerkb-database
requires API key provided by AttackerKB. Please set it via environment variable and runsync-attackerkb.py
inscripts/raw-data
manually. - The CPE dictionary is too large to be uploaded to GitHub. Please run
sync-cpe.sh
scripts in bothscripts/raw-data
andscripts/processed
locally.
- CVE (Common Vulnerabilities and Exposures)
- NVD (National Vulnerability Database)
- CWE (Common Weakness Enumeration)
- CAPEC (Common Attack Pattern Enumeration and Classification)
- CISA KEV (Known Exploited Vulnerabilities)
- ZDI Advisory
- GitHub Advisory
- MITRE ATT&CK
- MITRE D3FEND
- AttackerKB
- Exploit-DB
- oss-security mailing list
- full-disclosure mailing list
- bugtraq mailing list
- GitHub
- git.kernel.org
If you use this dataset, please cite the VulZoo paper:
@inproceedings{10.1145/3691620.3695345,
author = {Ruan, Bonan and Liu, Jiahao and Zhao, Weibo and Liang, Zhenkai},
title = {VulZoo: A Comprehensive Vulnerability Intelligence Dataset},
year = {2024},
isbn = {9798400712487},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3691620.3695345},
doi = {10.1145/3691620.3695345},
booktitle = {Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering},
pages = {2334–2337},
numpages = {4},
location = {Sacramento, CA, USA},
series = {ASE '24}
}