PHP Security vulnerability dataset

This dataset contains security vulnerability data and computed machine learning features for multiple versions of three PHP web applications: PHPMyAdmin, Moodle, and Drupal. This data was collected for a vulnerability prediction study; however, it can also be used for other empirical security vulnerability research not related to prediction. All vulnerabilities in this dataset were verified and localized to individual files by hand. In cases where multiple releases of an application were studied, the origin of each vulnerability and the path of each vulnerability's migration through the code over time is also recorded.

Version history

Enhancements and additions may periodically be added to this dataset. When modifications are made to a released data file, a new version of the dataset will be created and the older version will remain archived and accessible on this website.

Version history:

  1. 1.0: Initial release


This website contains two datasets: a raw dataset and a replication dataset. Most users will be interested in the raw dataset, which contains the complete vulnerability and feature data for each application version. A subset of the raw dataset was extracted and reformatted into a replication dataset, which contains the data and scripts required to replicate the results of the study entitled Predicting Vulnerable Components: Software Metrics vs Text Mining.

Click here to access the replication dataset

Click here to access the raw dataset