Data Classes
- class batcore.data.StandardDataset(dataset=None, max_file=86, commits=False, comments=False, user_items=False, file_items=False, pull_items=False, remove_empty=False, owner_policy='author_owner_fallback', remove='none', process_users=False, factorize_users=True, alias=False, remove_bots=True, bots='auto', project_name='', self_review_flag=False, checkpoint_path=None, verbose=False, log_file_path=None, log_stdout=False, log_mode='a')
dataset for most of the implemented models.
- Parameters:
dataset – GerritLoader-like object
max_file – maximum number of files that a review can have
commits – if False commits are omitted from the data
comments – if False comments are omitted from the data
user_items – if True user2id map is created
file_items – if True file2id map is created
pull_items – if true pull2id map is created
owner_policy – how pull owners are calculated. * None - owners are unchanged * author - commit authors of the pull are treated as owners * author_no_na - commit authors of the pull are treated as owners. pulls without an author are removed * author_owner_fallback - if pull has author, owner field set to the author. Otherwise, nothing is done
remove – list of columns to remove from the reviewers. Can be a subset of [‘owner’, ‘author’]
factorize_users – when true users are replaced by id
alias – True if clustering of the users by name should be performed
bots – strategy for bot identification in user factorization. When ‘auto’ bots will be determined automatically. Otherwise, path to the csv with bot accounts should be specified
project_name – name of the project for automatic bot detection
self_review_flag – when true adds a column to the pulls dataframe which signifies that there was a self-review (based on the users aliases)
- additional_preprocessing(events)
creates all item2id maps
- get_comments(dataset)
- Parameters:
dataset – GerritLoader-like dataset
- Returns:
preprocessed commits dataframe
- get_commits(dataset)
- Parameters:
dataset – GerritLoader-like dataset
- Returns:
preprocessed commits dataframe
- get_pulls(dataset)
- Parameters:
dataset – GerritLoader-like dataset
- Returns:
preprocessed pulls dataframe
- itemize_files(events)
creates file2id map from events
- itemize_pulls(events)
creates pull2id map from events
- itemize_users(events)
creates user2id map from events
- preprocess(dataset)
- Parameters:
dataset – GerritLoader-like dataset
- Returns:
preprocess all necessary events and returns them as data stream
- replace(data, rev)
A method that is used for simulating history :param data: a single pull that needed to ba modified :param cur_rec: reviewer to be added into a pull :return: pull with a modified reviewer list
- to_checkpoint(path)
saves dataset
- Parameters:
path – path to the folder to save results
- class batcore.data.MRLoaderData(path=None, from_date=None, to_date=None, verbose=False, log_file_path=None, log_stdout=False, log_mode='a')
Helping dataset class for the gerrit-based projects. It reads data in the format that is provided with our download script and outputs in a comfortable format
- Parameters:
path – path to the folder with the data from the loader tool or to the saved dataset
from_date (datetime.datetime) – all events before from_date are removed from the data
to_date (datetime.datetime) – all events after to_date are removed from the data
- static get_df(path)
reading all the csv files from MR-loader :param path: path to the directory with csv files :return: dictionary with all dataframes for pulls, commits, and comments
- prepare(data)
processing the data :param data: dictionary with all dataframes :return: pulls and commits dataframes with all mined features
- prepare_pulls()
pull preprocessing
- to_checkpoint(path)
saves dataset
- Parameters:
path – path to the folder to save results