Data Classes

class batcore.data.StandardDataset(dataset=None, max_file=86, commits=False, comments=False, user_items=False, file_items=False, pull_items=False, remove_empty=False, owner_policy='author_owner_fallback', remove='none', process_users=False, factorize_users=True, alias=False, remove_bots=True, bots='auto', project_name='', self_review_flag=False, checkpoint_path=None, verbose=False, log_file_path=None, log_stdout=False, log_mode='a')

dataset for most of the implemented models.

Parameters:
  • dataset – GerritLoader-like object

  • max_file – maximum number of files that a review can have

  • commits – if False commits are omitted from the data

  • comments – if False comments are omitted from the data

  • user_items – if True user2id map is created

  • file_items – if True file2id map is created

  • pull_items – if true pull2id map is created

  • owner_policy – how pull owners are calculated. * None - owners are unchanged * author - commit authors of the pull are treated as owners * author_no_na - commit authors of the pull are treated as owners. pulls without an author are removed * author_owner_fallback - if pull has author, owner field set to the author. Otherwise, nothing is done

  • remove – list of columns to remove from the reviewers. Can be a subset of [‘owner’, ‘author’]

  • factorize_users – when true users are replaced by id

  • alias – True if clustering of the users by name should be performed

  • bots – strategy for bot identification in user factorization. When ‘auto’ bots will be determined automatically. Otherwise, path to the csv with bot accounts should be specified

  • project_name – name of the project for automatic bot detection

  • self_review_flag – when true adds a column to the pulls dataframe which signifies that there was a self-review (based on the users aliases)

additional_preprocessing(events)

creates all item2id maps

get_comments(dataset)
Parameters:

dataset – GerritLoader-like dataset

Returns:

preprocessed commits dataframe

get_commits(dataset)
Parameters:

dataset – GerritLoader-like dataset

Returns:

preprocessed commits dataframe

get_pulls(dataset)
Parameters:

dataset – GerritLoader-like dataset

Returns:

preprocessed pulls dataframe

itemize_files(events)

creates file2id map from events

itemize_pulls(events)

creates pull2id map from events

itemize_users(events)

creates user2id map from events

preprocess(dataset)
Parameters:

dataset – GerritLoader-like dataset

Returns:

preprocess all necessary events and returns them as data stream

replace(data, rev)

A method that is used for simulating history :param data: a single pull that needed to ba modified :param cur_rec: reviewer to be added into a pull :return: pull with a modified reviewer list

to_checkpoint(path)

saves dataset

Parameters:

path – path to the folder to save results

class batcore.data.MRLoaderData(path=None, from_date=None, to_date=None, verbose=False, log_file_path=None, log_stdout=False, log_mode='a')

Helping dataset class for the gerrit-based projects. It reads data in the format that is provided with our download script and outputs in a comfortable format

Parameters:
  • path – path to the folder with the data from the loader tool or to the saved dataset

  • from_date (datetime.datetime) – all events before from_date are removed from the data

  • to_date (datetime.datetime) – all events after to_date are removed from the data

static get_df(path)

reading all the csv files from MR-loader :param path: path to the directory with csv files :return: dictionary with all dataframes for pulls, commits, and comments

prepare(data)

processing the data :param data: dictionary with all dataframes :return: pulls and commits dataframes with all mined features

prepare_pulls()

pull preprocessing

to_checkpoint(path)

saves dataset

Parameters:

path – path to the folder to save results