PHP Security vulnerability dataset

This dataset contains security vulnerability data and computed machine learning features for multiple versions of three PHP web applications: PHPMyAdmin, Moodle, and Drupal. This data was collected for a vulnerability prediction study; however, it can also be used for other empirical security vulnerability research not related to prediction.

All vulnerabilities in this dataset were verified by the dataset's authors and localized to individual files by hand. In cases where multiple releases of an application were studied, the origin of each vulnerability and the path of each vulnerability's migration through the code over time is also recorded.

Application versions present in dataset

All data is generally available for the following releases of the following applications (exceptions are noted below):

Source code metrics

These R data files contain a selection of computed, file-level source code metrics for each version of each application.

A small number of NA values may appear in the matrices for certain releases for one of the following reasons:

(674K) File metrics for PHPMyAdmin

(1.6MB) File metrics for Moodle

(6.1K) File metrics for Drupal

Vulnerability data

These R data files contain data frames describing each vulnerability analyzed for each application. The following columns are present in these R data files:

(4.7K) Vulnerability data for PHPMyAdmin

(3.5K) Vulnerability data for Moodle

(1.3K) Vulnerability data for Drupal

Vulnerability tracking data

Because multiple major releases of PHPMyAdmin and Moodle were studied, some vulnerabilities migrated from file to file over the studied releases. These R-formatted matrices specify the file affected by each vulnerability at the time of each release. An empty string indicates that the vulnerability had not yet been introduced, or that it was already fixed.

(4.2K) Vulnerability tracking data for PHPMyAdmin

(2.8K) Vulnerability tracking data for Moodle

Release and branch information

The multiple releases of PHPMyAdmin and Moodle were extracted from the applications' main branches, eliminating maintenance releases which would cause the version numbers to be non-increasing over time. These R data files contain metadata for each of these releases. These data frames contain the following columns:

(3.5K) Main branch release metadata for PHPMyAdmin

(2.7K) Main branch release metadata for Moodle

Text mining token data

Pre-extracted token feature data in ARFF format, prepared using the above tokenize.php script in the replication dataset, are available here.