User Guide

A general introduction

The dummyML package encompasses all the necessary elements for constructing automated data analysis pipelines utilizing commonly employed machine learning algorithms. These elements include data preprocessing, data splitting, model selection, model fitting, and model evaluation. Below is a summary of some of the package’s essential components.

Read and summarize the raw data

To read and provide a summary of the raw data, utilize the script below and adjust the arguments as needed for controlling data loading and summarization behavior:

import dummyML.utilities as utilities
help(utilities.read_data)
help(utilities.summarize_data)

Key points

  • The allowed data set formats are limited to csv and sas7bdat.

  • The data set should only consist of features X and outcome y.

  • The generated summary report in HTML format can be found in the working_directory/results directory.

Tips

  • By default, only a random 1000 samples from the original data will used to summarize the data. If you prefer using all the samples to generate the summary, you should try max_samples=data.shape[0].

Data splitting

To split the data, refer to the following code:

help(utilities.split_data)

Key points

  • When test_size is set to 0, K-Fold Cross Validation will be utilized to evaluate the chosen models.

  • When test_size is greater than 0, the specified test size will be employed to evaluate the chosen models. K-Fold Cross Validation can also be utilized in this case.

Tips

  • When sample size is small, e.g., 100 samples, saveing 20% samples as the test set maybe is not a good idea. You can always set test_size = 0 and use K-Fold Cross Validation to do model selection and to evaluate the selected models.

  • When you are exploring the models, we suggest to set random_state as None. When you want to compare models trained in different baches, you have to set random_state to the same number, then you will have a fair comparison.

Data preprocessing

To control the data preprocessing process, use the script below and adjust the arguments as needed: .. code:

from dummyML.preprocessing import data_preprocessing
help(data_preprocessing)

Key points

  • Samples with missing values at the outcome will be removed.

  • Features with unique values less than a certain threshold, such as 15, will be treated as categorical variables.

  • Columns that contain strings and have over 15 unique values will be treated as text data and dropped.

  • Samples and features with more than a certain threshold, such as 0.5, missing values will be dropped.

  • Missing values in categorical variables will be taken as a new level.

  • Two coding methods for categorical variables will be used: dummy coding and ordinal coding.

  • Median imputation will be used to address missing values in quantitative variables.

  • Variables with a single unique value will be dropped.

  • The data preprocessing steps will be saved and can be utilized for applying the same preprocessing steps on future test sets.

Tips

  • When for_future_test is set to False, the saved preprocessing steps can still be applied to raw future data. However, if there is an unknown class in the explanatory variables (or features), applying the saved preprocessing steps will result in an error.

  • When the focus is on prediction, setting for_future_test to True will not affect the prediction performance. However, if the features need to be explained, it is better to set for_future_test to False.

  • If variables with limited unique values are ordinal variables, one-hot encoding is not necessary. In this situation, cat_levels_threshold can be set to 2, and only variables with two unique values and categorical variables will be coded as categorical variables.

  • For tree-based models, ordinal coding can be used. For linear methods, ordinal coding is not recommended.

Select and fit the models

To control the behavior of automated model selection and fitting, you can use the following script:

from dummyML.automate_modeling_evaluation import automate_modeling
help(automate_modeling)

Key points

  • The type of modeling task, such as regression, binary or multiclass classification, is determined based on the data type and unique values of the outcome variable y.

  • The following models are available for all three types of tasks: standard linear model (linear), linear model with lasso penalty (lasso), linear model with ridge penalty (ridge), linear model with ElasticNet penalty (elasticNet), support vector machine (svm), neural network (nn), gradient boosting (gb), and random forest (rf).

  • For highly imbalanced data, where the majority category is over 10 times the minority category, ensemble-based imbalanced learning models such as balanced random forest model, random under-sampling integrated in the learning of AdaBoost, and bag of balanced boosted learners are used for modeling the data.

  • For lasso, ridge, and elasticNet, model selection is performed using the model selection procedures provided by sklearn.

  • Bayesian optimization is used for model selection of svm, nn, gb, and rf. The evaluation is based on either K-Fold CV or the performance on the validation set.

Tips

  • Support vector machine and neural network are computationally intensive for large sample sizes.

  • For small sample sizes (less than 100), Leave-One-Out Cross Validation (LOOCV) is recommended as the metric in the model selection process. You can set the cv to the number of samples in the training data.

  • In the model selection of gradient boosting, random forest, and neural network, shortcuts are used to save time. If you want to use K-Fold CV for model selection, set cv_force to True.

  • For very large sample sizes, it is recommended to use ridge regression or lasso instead of elasticNet, as the search space of elasticNet is quite large.

  • Don’t set cv_force to be True for large sample sizes.

Evaluate the selected models using multiple metrics

key points

  • The chosen model can be assessed using the test set, K-Fold CV, or both.

  • For classification problems, the following metrics are used: sensitivity, specificity, balanced accuracy, recall, precision, F1 score, AUC, postive or negative predictive values.

  • For regression problems, the following metrics are used: R-squared, mean squared error (MSE), and mean absolute error (MAE).

  • When dealing with imbalanced binary outcomes, where the majority class dominates the minority class, the above classification metrics may not be appropriate. To obtain a fair assessment of the models, we can adjust the cutoff. For more details, refer to the semi-automatic data analysis pipeline.

Tips

  • When the sample size is small, relying solely on the test set to evaluate model performance can be misleading. In such cases, it is advisable to use K-Fold CV in addition to the test set.

Saved data report, preprocessed data and saved models

The generated data summarization report, preprocessed data sets, saved data preprocessing steps, saved models and saved evaluation metrics are in the results folder of the current working directory.