Data Description

Processing MRLoader output

To process the data loaded with MRLoader we provide MRLoaderData class. MRLoaderData processes data from the raw output and filters it from from_date to to_date:

from batcore.data import MRLoaderData

data = MRLoaderData('path/to/the/directory/containing/output/of/MRLoader',
                     from_date=datetime(), # all events before are removed
                     to_date=datetime(), # all events after are removed
                    )

Resulting instance of MRLoaderData has three relevant fields: pulls, comments, and commits.

Dataframes

pulls is a pandas DataFrame with information about pull requests. It has following fields:

  • key_change - identifier of the pull request

  • file - list of files’ paths changed in the pull request

  • reviewer - list of reviewers

  • date - time of the creation of the pull request (datetime.datetime instance)

  • owner - list of owners of the pull requests. We identify users by strings with the following format “{name}:{e-mail}:{login}”.

  • title - title of the pull request

  • status - status of the pull request (one of three: “MERGED”, “OPEN”, and “ABANDONED”)

  • closed - date when pull request was closed

  • author - set with authors of the pull request (i.e. authors of the commits)

commits is a pandas DataFrame with information about commits. It has following fields:

  • key_commit - identifier of the commit

  • key_change - id of the pull request with the commit

  • key_file - file path of the changed file (in case of several changed files, there will be several entries with same key_commit)

  • key_user - author of the commit

  • date - date of the commit

comments is a pandas DataFrame with information about commits. It has following fields:

  • key_change - id of the pull request with the commit

  • key_file - file path of the changed file (nan when the comment was made not to the file)

  • key_user - author of the commit

  • date - date of the commit

Saving and Loading

You can save data from MRLoaderData with:

data.to_checkpoint('path/to/checkpoint')

And load with:

data = MRLoaderData().from_checkpoint('path/to/checkpoint')

Custom data

from_checkpoint method can also be used to load custom data. Provided data should be in a specific format. Path to checkpoint should lead to the folder with pulls.csv and optionally with comments.csv and commits.csv. Those files should have a header with feature names from above and first index column. Not all of the features are necessary for all of the models and some of them can be omitted and replaced with nan values. All of the relevant columns must contain data in the form described above with the following exceptions:

  • dates - date-string with the following format “yyyy-mm-dd hh:mm:ss”

  • user_ids - can be anything, but if you use our implementation of alias matching it should be a string with the format “{name}:{e-mail}:{login}”

Required data for different models

  • ACRec
    • pulls: date, reviewer, key_change, owner

    • comments: key_change, key_user

  • cHRev
    • pulls: date, reviewer, key_change, file, owner

    • comments: date, key_user, key_file

  • RevFinder
    • pulls: date, reviewer, key_change, file, owner

  • RevRec
    • pulls: date, reviewer, key_change, owner, file

    • comments: date, key_user, key_file, key_change

  • Tie
    • pulls: date, reviewer, key_change, title, file, owner

  • WRC
    • pulls: date, reviewer, key_change, file, owner

  • xFinder
    • pulls: date, reviewer, key_change, owner, file

    • commits: date, key_user, key_file

  • CN
    • pulls: date, reviewer, key_change, owner

    • comments: date, key_user, key_change