PHP Security vulnerability dataset
This dataset contains security vulnerability data and computed machine learning features for multiple versions of three PHP web applications: PHPMyAdmin, Moodle, and Drupal. This data was collected for a vulnerability prediction study; however, it can also be used for other empirical security vulnerability research not related to prediction.
All vulnerabilities in this dataset were verified by the dataset's authors and localized to individual files by hand. In cases where multiple releases of an application were studied, the origin of each vulnerability and the path of each vulnerability's migration through the code over time is also recorded.
Application versions present in dataset
All data is generally available for the following releases of the following applications (exceptions are noted below):
- PHPMyAdmin: 95 releases between 2.2.0 and 4.0.9
- Moodle: 71 releases between 1.0.0 and 2.6.1
- Drupal: Release 6.0.0 only
Source code metrics
These R data files contain a selection of computed, file-level source code metrics for each version of each application.
A small number of NA values may appear in the matrices for certain releases for one of the following reasons:
- A metric could not be computed because the file utilized new language constructs introduced in the latest versions of PHP, which are not supported by our PHP metric computation tool. This occurred frequently in releases 2.2.0 and later of Moodle; for this reason, file metrics are not present for these releases. We anticipate releasing a future version of this dataset once this issue can be addressed.
- An abstract syntax tree could not be constructed for the file. This typically occurs in files consisting entirely of translation strings which are unparsable in the default locale.
(674K) File metrics for PHPMyAdmin
(1.6MB) File metrics for Moodle
(6.1K) File metrics for Drupal
Vulnerability data
These R data files contain data frames describing each vulnerability analyzed for each application. The following columns are present in these R data files:
- cveid: The CVE identifier for the vulnerability
- drupalid: (Drupal only) The Drupal vulnerability tracking ID for the vulnerability.
- introhash: (PHPMyAdmin and Moodle) The Git hash of the commit where the vulnerability was introduced into the source code. This column is used as an identifier in the vulnerability tracking matrix; when multiple vulnerabilities were introduced in a single commit, these identifiers are made unique with a trailing underscore and unique key.
- fixhash: (PHPMyAdmin and Moodle) The Git hash of the commit where the vulnerability was fixed.
- lastvulversion: (Drupal only) The version number of the last release of the software affected by the vulnerability. (This information can be read from the vulnerability tracking matrix for PHPMyAdmin and Moodle.)
- fixedversion: (Drupal only) The version number of the first release of the software where the vulnerability had been fixed. (This information can be read from the vulnerability tracking matrix for PHPMyAdmin and Moodle.)
- filename: (Drupal only) The name of the PHP file affected by the vulnerability. (Because only version 6 of Drupal was considered in this dataset, no vulnerabilities migrated between files during the study period.)
- fixfile: (PHPMyAdmin and Moodle) The name of the PHP file affected by the vulnerability at the time that the vulnerability was fixed.
(4.7K) Vulnerability data for PHPMyAdmin
(3.5K) Vulnerability data for Moodle
(1.3K) Vulnerability data for Drupal
Vulnerability tracking data
Because multiple major releases of PHPMyAdmin and Moodle were studied, some vulnerabilities migrated from file to file over the studied releases. These R-formatted matrices specify the file affected by each vulnerability at the time of each release. An empty string indicates that the vulnerability had not yet been introduced, or that it was already fixed.
(4.2K) Vulnerability tracking data for PHPMyAdmin
(2.8K) Vulnerability tracking data for Moodle
Release and branch information
The multiple releases of PHPMyAdmin and Moodle were extracted from the applications' main branches, eliminating maintenance releases which would cause the version numbers to be non-increasing over time. These R data files contain metadata for each of these releases. These data frames contain the following columns:
- versions: The version number of the release in question
- dates: The release date of the release in question
- githashes: The Git hash of the last commit finalizing the release in question
(3.5K) Main branch release metadata for PHPMyAdmin
(2.7K) Main branch release metadata for Moodle
Text mining token data
Pre-extracted token feature data in ARFF format, prepared using the above tokenize.php script in the replication dataset, are available here.