-
analyzed_projects_all.csvcontains in CSV format the list of all cloned projects at the time of this study.repo_nameis the repository name;last_checkoutis the hash of the last commit available at the time of the clone, and;dateis the date of the latest available commit.
-
detailed-databaseis a folder containing the two complete datasets we defined.overall.jsoncontains all the instances of our dataset (1,930);language-filtered.jsoncontains 1,115 instances involving files in the following languages: C, Python, C++, JavaScript, Java, PHP, Ruby, and C#. Both these datasets are JSON arrays. Each element has the following structure:idis a unique ID used during the construction phase, it is a univocal value for every entry;repositoryis the repository name as hosted in GitHub (owner/project-name);fixcontains information about the fix, including:commit: meta-data about the commit, including:hash: commit hash;message: commit message;author: commit author;url: GitHub API URL with complete information about the commit;
files: an array of files modified in the fix commit; each element provides:name: name of the modified file after the commit (this is not the complete path, just the file name);old_path/new_path: the path of the file before and after the commit;lang: extension of the file (indicating the programming language);lines_added/lines_deleted: lists of line numbers added/deleted;change_type: type of change (one of the following: "MODIFY"/"ADD"/"RENAME"/"DELETE");
bugscontains the list of bug-inducing-commits for the fix; each element includes:commit: meta-data about the commit, including:hash: commit hash;message: commit message;author: commit author;url: GitHub API URL with complete information about the commit;
files: an array of files modified in the fix commit; each element provides:name: name of the modified file after the commit (this is not the complete path, just the file name);old_path/new_path: the path of the file before and after the commit;lang: extension of the file (indicating the programming language);lines_added/lines_deleted: lists of line numbers added/deleted;change_type: type of change (one of the following: "MODIFY"/"ADD"/"RENAME"/"DELETE");
issue_urlsis a list of URLs of issues referenced in the fix commit;earliest_issue_dateis the date of the earliest issue referenced in the fix commit (YYYY-MM-DDTHH:MM:SS);best_scenario_issue_daterepresents the date of an ideal issue reported for the bug; it is the date of the last bug-inducing commit incremented by 60 seconds (YYYY-MM-DDTHH:MM:SS).
-
json-input-rawis a folder containing four datasets used as input for our experimentations, derived fromlanguage-filtered.json.bugfix_commits_all.jsonandbugfix_commits_issues_only.jsoncontain 1,115 and 129 instances in JSON format, respectively.bugfix_commits_all_java.jsonandbugfix_commits_issues_only_java.jsoncontain 80 and 10 instances in JSON format, respectively.
These datasets represent the input list of the selected fix commits and its relative list of bug-inducing commits, other than the following additional information used in our SZZ evaluation.idis a unique ID used during the construction phase, it is a univocal value for every entry;repo_nameis the repository name as hosted in GitHub;fix_commit_hashis the commit's hash of the selected fix;bug_commit_hashis a list of bug-inducing commits;earliest_issue_dateis a string containing the timestamp of the earliest issue (YYYY-MM-DDTHH:MM:SS);best_scenario_issue_daterepresents the date of an ideal issue reported for the bug; it is the date of the last bug-inducing commit incremented by 60 seconds (YYYY-MM-DDTHH:MM:SS);issue_urlsis a list of URLs of issues referenced in the fix commit;languageis a list of the programming languages of the files impacted by the fix commit.
-
clonedis a placeholder folder where git repositories must be copied (or cloned) to replicate this work. See the instructions below. -
json-output-rawis a folder containing a list of JSON files containing our pre-calculated results for each SZZ algorithm. -
scriptsis a folder that contains all scripts created to post-process or analyze our data. -
toolsis a folder that contains a snapshot of developed codes. For new studies, please use the extended version PySZZ v2. -
resultsis a folder that contains all calculated metrics, such as Precision, Recall, F-measure, etc.
The following are the instructions needed to execute our suite of tools and generate our results. This example refers to the B-SSZ variant, but any other algorithm can be reproduced by changing the input arguments as detailed in the original guide. See tools/pyszz.zip for more instructions.
-
Preparing input data. As the first step you need to clone the git repository of every project. You can rely on the following approach.
- As an alternative, you can clone into
clonedfolder each repository and then checkout the list of commit's hashes contained inanalyzed_projects_all.csvandanalyzed_projects_issues_only.csv. This recreates the exact same conditions of our experiment.
- As an alternative, you can clone into
-
Running SZZ. PySZZ (see
tools/pyszz.zipfor a replication snapshot, and check the reported URL for the latest version) is a free open-source suite of tools used to implement in Python all SZZ major variants. You can run a specific variant by passing a pre-definedymlfile or experiment with custom inputs. E.g.,conf/bszz.ymlactivates B-SZZ variant.
python3 main.py json-input-raw/bugfix_commits_all.json conf/bszz.yml cloned runs B-SZZ algorithm.
Where:
json-input-raw/bugfix_commits_all.jsonis the input list of fixes;conf/bszz.ymlis a pre-defined list of settings used to activate a specific variant (seetools/pyszz.zipfor more details);clonedis the folder containing a list of pre-cloned repositories.
NOTE. SZZUnleashed and OpenSZZ are not part of PySZZ suite. We adapted the original implementations to our input formats.
- The SZZUnleashed implementation has been forked to handle our input formant and add parallel support SZZUnleashed-adapted
(See
tools/szz-unleashed.zipas a snapshot of our adapter) - The OpenSZZ implementation has been forked to exclude the Jira filter OpenSZZ (See
tools/open-szz.zipas a snapshot of our adapter) OpenSZZ needs post-processing to adapt the generated results to our JSON format. See below OpenSZZ post-processing script
Both snapshots tools/szz-unleashed.zip and tools/open-szz.zip contain the instructions to use our adapters.
json-output-raw contains a list of JSON files generated by each SZZ variant.
Specifically, bic_<algorithm-name>_bugfix_commits_all.json and bic_<algorithm-name>_bugfix_commits_issues_only.json refer to the output of <algorithm-name> SZZ variant.
Instead, bic_<algorithm-name>_bugfix_commits_all-filter.json and bic_<algorithm-name>_bugfix_commits_issues_only-filter.json is the post-filtered output when the filter on issue data is applied.
We use ruby postfilter.rb <json-output> <cloned> to post-process bic_<algorithm-name>_bugfix_commits_all.json and bic_<algorithm-name>_bugfix_commits_issues_only.json and generate bic_<algorithm-name>_bugfix_commits_all-filter.json and bic_<algorithm-name>_bugfix_commits_issues_only-filter.json, as a reduced list of datapoints filter by issue's date.
postfilter.rbis our ruby script used to parse the output of any SZZ algorithm to filter out BIC commits that do not respect the issue date condition.<json-output>is the input folder containing the list of JSON files produced by PySZZ;<cloned>is the path to the pre-cloned (or checked out) repositories.
overlap.py is a Python script with embedded input paths that can be used to calculate Recall, Precision, F-measure, and overlap.
You may need to adapt base_path global variable to point to your result's directory. E.g., base_path = "json-output-raw/" analyzes the study's results.
This tool produces:
-
<dataset>-recall-precision.csvlists Precision, Recall, F-measure, total number of correct instances (our oracle), and total number of identified instances; -
<dataset>-overlap_vi_vj.csvlists the overlap, the total number of BIC uniquely identified, the total number of correctly identified, and the union of all BIC correctly identified by all models; -
<dataset>-overlap_vi_but_others.csvis a CSV version of the heatmap for the overlap comparison. -
<dataset>-not-identified.csvsummarizes the not found BICs; -
<dataset>-heatmap.pdfas reported in the manuscript. -
wrongis a subfolder with a list of CSV files containing the wrongly identified BIC with a link to GitHub FIX commit.
OpenSZZ produces three files for each analyzed instance. E.g., AIFDR_inasafe_BugFixingCommit.csv, AIFDR_inasafe_BugInducingCommits.csv, and AIFDR_inasafe.txt.
To transform all these CSV files in a single JSON file compatible to overlap.py we create a small script openszz_file_refactoring.py.
python3 openszz_file_refactoring.py <oracle> <openszz-issue> <bic_open_bugfix_commits_issues_only.json>
Where:
<oracle>is the list of fixes. E.g.,json-input-raw/bugfix_commits_all.json;<openszz-issue>is the folder path where openSZZ produces its results;<bic_open_bugfix_commits_issues_only.json>is the destination file output where to store in JSON format openSZZ bug-inducing commits;