pipeline.modeling package

Submodules

pipeline.modeling.data_to_df module

class pipeline.modeling.data_to_df.LoadDF(config_path, feather_dir='./data/feathered_data')

Bases: object

Takes a dataset of a bunch of csvs and converts them into a single DataFrame

This dataframe can be returned or saved as a feather database for fast load times.

We name our compressed datset using a hash of the features in the dataset. This enables the identical dataset to be loaded if the features haven’t changed.

A dataset configuration file has three parts: feature_sets, feature_files, and features

feature_sets: the types of features to be included feature_files: the paths to the csv files for each feature set features: the feature names (and csv column headers) for each feature set

load_all_dataframes(**kw)

pipeline.modeling.data_utils module

class pipeline.modeling.data_utils.TransformDF

Bases: object

apply_rolling_window(**kw)
normalize_dataset(**kw)
sub_sample(**kw)

pipeline.modeling.datasets module

pipeline.modeling.model_defs module

pipeline.modeling.model_monitoring module

class pipeline.modeling.model_monitoring.EarlyStopping(name, mode='min', min_delta=0.001, patience=10, percentage=False)

Bases: object

step(metric, verbose=True)

Compare metric against last time step to determine if training should stop.

Parameters
  • metric (float) – Metric can be a loss value or a model performance metric (e.g. accuracy)

  • verbose (bool) – Print a status explain why stop or continue is recommended

Returns

Indication of whether or not to stop now (True-> stop; False-> continue)

Return type

[bool]

pipeline.modeling.model_performance module

class pipeline.modeling.model_performance.ModelMetrics(params)

Bases: object

calculate_metrics(labels, preds, probs=None, output_dict=True, summary_stat='macro avg', verbose=False)
graph_model_output(actual_labels, predicted_labels, probabilities=None, max_graph_size=1000, title='Graph Title')
listify_metrics(metrics_dict, loss=0)

Convert metrics from a dictionary to a list

The dictionary of all metrics is converted to a list and the columns are saved in self.metrics_names. The loss is not always used for

Parameters
  • metrics_dict (dict) – dictionary of all performance metrics

  • loss (int, optional) – cumulative loss for a given epoch. Defaults to 0.

Returns

a list of the performance metrics values and a dataframe

Return type

list, DataFrame

plot_metrics(metrics, metrics_names, verbose=False)

pipeline.modeling.model_training module

pipeline.modeling.select_features module

pipeline.modeling.select_features.calculateCorr(df, corr_method, threshold)

Methods include ‘pearson’, ‘kendall’, ‘spearman’

pipeline.modeling.select_features.get_args()argparse.ArgumentParser
pipeline.modeling.select_features.intersection(feature_lists)
pipeline.modeling.select_features.main()
pipeline.modeling.select_features.select_by_correlation(feature_csv_path, correlation_method='pearson', threshold=0.7)

Module contents