predict_ml()

datasafari.predictor.predict_ml(
df: DataFrame,
x_cols: List[str] | None = None,
y_col: str | None = None,
formula: str | None = None,
data_state: str = 'unprocessed',
n_top_models: int = 3,
test_size: float = 0.2,
cv: int = 5,
random_state: int = 42,
priority_metrics: List[str] = [],
refit_metric: str | None = None,
priority_tuners: List[str] | None = None,
custom_param_grids: dict | None = None,
n_jobs: int = -1,
n_iter_random: int | None = None,
n_iter_bayesian: int | None = None,
priority_models: List[str] | None = None,
model_kwargs: dict | None = None,
verbose: int = 1,
numeric_imputer: TransformerMixin = SimpleImputer(strategy='median'),
numeric_scaler: TransformerMixin = StandardScaler(),
categorical_imputer: TransformerMixin = SimpleImputer(fill_value='missing', strategy='constant'),
categorical_encoder: TransformerMixin = OneHotEncoder(handle_unknown='ignore'),
text_vectorizer: TransformerMixin = CountVectorizer(),
datetime_transformer: Callable[[DataFrame], DataFrame] | None = None,
) Dict[str, Any][source]

Streamline the entire process of data preprocessing, model selection, and tuning, delivering optimal model recommendations based on the data provided.

Depending on the inputs, this function can either perform statistical inference or predictive model selection using machine learning.
  • Machine Learning Pipeline: Focuses on predictive model selection and hyperparameter tuning using scikit-learn. It includes preprocessing (optional), model recommendation based on specified metrics, and tuning using grid search, random search, or Bayesian optimization.

  • Inference Pipeline: Utilizes statsmodels for detailed statistical analysis and model fitting based on a specified formula. This pipeline is tailored for users seeking statistical inference, providing metrics such as AIC, BIC, and R-squared. This pipeline assumes the data to have been preprocessed appropriately beforehand.

Parameters:

dfpd.DataFrame

The DataFrame containing the dataset to be analyzed.

x_colsList[str], optional, default: None

List of column names to be used as features for machine learning model recommendation.

y_colstr, optional, default: None

Column name to be used as the target for machine learning model recommendation.

formulastr, optional, default: None

A Patsy formula for specifying the model in the case of statistical inference.

data_statestr, optional, default: ‘unprocessed’

Specifies the initial state of the data ('unprocessed' or 'preprocessed').

  • 'unprocessed' will trigger the customizable preprocessing procedure.

  • 'preprocessed' will omit the preprocessing procedure. Only suitable for preprocessed data!

n_top_modelsint, optional, default: 3

Number of top models to recommend from the evaluation.

test_sizefloat, optional, default: 0.2

Proportion of the dataset to be used as the test set.

cvint, optional, default: 5

Number of cross-validation folds.

random_stateint, optional, default: 42

Controls the shuffling applied to the data before applying the split.

priority_metricsList[str], optional, default: []

Metrics to prioritize in model evaluation in the machine learning pipeline. Note: The list members must be in the correct format as specified below.

Available Metrics:

  • Regression: 'explained_variance', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_root_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'r2', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'neg_mean_absolute_percentage_error'

  • Classification: 'accuracy', 'balanced_accuracy', 'average_precision', 'neg_brier_score', 'f1_micro', 'f1_macro', 'f1_weighted', 'neg_log_loss', 'precision_micro', 'precision_macro', 'precision_weighted', 'recall_micro', 'recall_macro', 'recall_weighted', 'jaccard_micro', 'jaccard_macro', 'jaccard_weighted', 'roc_auc_ovr', 'roc_auc_ovo'

refit_metricstr, optional, default: None

Metric to use for refitting the models in the machine learning pipeline. Note: The string must be in the correct format as specified below.

  • If None, the function will use the first member of priority_matrics.

  • If None and no priority_metrics are provided, the function defaults to 'Accuracy' for classification models and 'MSE' for regression models.

    Available Refit Metrics:

    • Regression: 'EV', 'MAE', 'MSE', 'RMSE', 'MSLE', 'MedAE', 'R2', 'MPD', 'MGD', 'MAPE'

    • Classification: 'Accuracy', 'Balanced Accuracy', 'Average Precision', 'Neg Brier Score', 'F1 (Micro)', 'F1 (Macro)', 'F1 (Weighted)', 'Neg Log Loss', 'Precision (Micro)', 'Precision (Macro)', 'Precision (Weighted)', 'Recall (Micro)', 'Recall (Macro)', 'Recall (Weighted)', 'Jaccard (Micro)', 'Jaccard (Macro)', 'Jaccard (Weighted)', 'ROC AUC (OVR)', 'ROC AUC (OVO)'

priority_tunersList[str], optional, default: None

Tuners to use for hyperparameter tuning in the machine learning pipeline. Note: The list members must be in the correct format as specified below.

Available Tuners: 'grid', 'random', 'bayesian'

custom_param_gridsdict, optional, default: None

Custom parameter grids for tuning in the machine learning pipeline. Note: Template dictionaries are provided at the end of this page.

n_jobsint, optional, default: -1

Number of jobs to run in parallel. -1 means using all processors/parallel processing.

n_iter_randomint, optional, default: None

Number of iterations for random search tuning in the machine learning pipeline.

n_iter_bayesianint, optional, default: None

Number of iterations for Bayesian optimization in the machine learning pipeline.

priority_modelsList[str], optional, default: None

Specific models to evaluate in the inference pipeline. Note: The list members must be in the correct format as specified below.

  • If None the function will assess all appropriate models.

    Available Inferential Models:

    • Regression: 'OLS', 'WLS', 'GLS', 'RLM', 'QuantReg', 'GLSAR', 'MixedLM', 'PHReg'

    • Classification: 'Logit', 'Probit', 'MNLogit', 'Poisson', 'NegativeBinomial', 'GEE', 'NominalGEE', 'OrdinalGEE'

model_kwargsdict, optional, default: None

Keyword arguments to pass to model constructors in the inference pipeline.

verboseint, optional, default: 1

Level of verbosity in output.

numeric_imputerTransformerMixin, optional, default: SimpleImputer(strategy=’median’)

Imputer for handling missing values in numerical data, if data_state='unprocessed'.

Any imputer from scikit.impute can be used instead of the default.

numeric_scalerTransformerMixin, optional, default: StandardScaler()

Scaler for numerical data, if data_state='unprocessed'.

Any scaler from scikit.preprocessing can be used instead of the default.

categorical_imputerTransformerMixin, optional, default: SimpleImputer(strategy=’constant’, fill_value=’missing’)

Imputer for handling missing values in categorical data, if data_state='unprocessed'.

Any imputer from scikit.preprocessing can be used instead of the default.

categorical_encoderTransformerMixin, optional, default: OneHotEncoder(handle_unknown=’ignore’)

Encoder for categorical data, if data_state='unprocessed'.

Any enoder from scikit.preprocessing can be used instead of the default.

text_vectorizerTransformerMixin, optional, default: CountVectorizer()

Vectorizer for text data, if data_state='unprocessed'.

Any vectorizer from sklearn.feature_extraction.text can be used instead of the default.

datetime_transformercallable, optional, default: None

Transformer for datetime data, if data_state='unprocessed'.

Note: This parameter defaults to a custom datetime transformer, which extracts year, month, and day as separate features. This is an experimental feature and it is not recommended to use other solutions.

Returns:

Dict[str, Any]
Depending on the operation mode, the dictionary contains either:
  • top machine learning models and their evaluation metrics,

  • top statistical models along with their fit statistics.

Examples:

Import necessary libraries and generate a DataFrame for examples:

>>> import datasafari
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
...     'Age': np.random.randint(18, 35, size=100),
...     'Salary': np.random.normal(50000, 12000, size=100),
...     'Department': np.random.choice(['HR', 'Tech', 'Marketing'], size=100),
...     'Review': ['Good review']*50 + ['Bad review']*50,
...     'Employment Date': pd.date_range(start='2010-01-01', periods=100, freq='M')
... })

Machine Learning Pipeline Examples

Simple Machine Learning Pipeline:

>>> x_cols = ['Age', 'Salary', 'Department', 'Review', 'Employment Date']
>>> y_col = 'Salary'
>>> ml_models = predict_ml(df, x_cols=x_cols, y_col=y_col, verbose=2)

Utilizing Priority Metrics and Refit Metric in Machine Learning Pipeline:

>>> priority_metrics = ['neg_mean_squared_error', 'r2']
>>> ml_models_priority_metrics = predict_ml(
...     df,
...     x_cols=x_cols,
...     y_col=y_col,
...     priority_metrics=priority_metrics,
...     refit_metric='r2',
...     verbose=2
... )

Integrating Priority Tuners with Custom Parameter Grids:

>>> custom_grids = {
...     'RandomForestClassifier': {
...         'n_estimators': [100, 200],
...         'max_depth': [None, 10, 20]
...     }
... }
>>> priority_tuners = ['random', 'grid']
>>> ml_models_with_custom_tuning = predict_ml(
...     df,
...     x_cols=x_cols,
...     y_col=y_col,
...     priority_metrics=priority_metrics,
...     refit_metric='r2',
...     priority_tuners=priority_tuners,
...     custom_param_grids=custom_grids,
...     verbose=2
... )

Advanced Machine Learning Pipeline Using Bayesian Optimization:

>>> priority_tuners = ['bayesian']
>>> n_iter_bayesian = 50
>>> ml_models_bayesian = predict_ml(
...     df,
...     x_cols=x_cols,
...     y_col=y_col,
...     priority_metrics=priority_metrics,
...     refit_metric='r2',
...     priority_tuners=priority_tuners,
...     n_iter_bayesian=n_iter_bayesian,
...     verbose=2
... )

Inference Pipeline Examples

Simple Inference Example Using a Statistical Model:

Find the relationship between Salary and Age using Ordinary Least Squares (OLS) regression:

>>> formula = 'Salary ~ Age'
>>> inference_result_ols = predict_ml(df, formula=formula, verbose=2)

Using a Categorical Predictor with OLS:

Incorporate a categorical variable (Department) in the model to examine its effect on Salary:

>>> formula = 'Salary ~ Age + C(Department)'
>>> inference_result_ols_categorical = predict_ml(df, formula=formula, verbose=2)

Advanced Inference with Multiple Models and Specific Metrics:

Compare multiple regression models focusing on their fit statistics:

>>> priority_models = ['OLS', 'WLS', 'GLS']
>>> advanced_inference_models = predict_ml(
...     df,
...     formula=formula,
...     priority_models=priority_models,
...     verbose=2
... )

Inference with Robust Regression Model:

Utilize Robust Linear Models (RLM) to mitigate the influence of outliers:

>>> formula = 'Salary ~ Age + C(Department)'
>>> robust_inference_result = predict_ml(
...     df,
...     formula=formula,
...     priority_models=['RLM'],
...     verbose=2
... )

Mixed Linear Model for Hierarchical or Longitudinal Data:

Apply a Mixed Linear Model (MixedLM) if the data structure involves nested or grouped observations:

>>> formula = 'Salary ~ Age + C(Department) + (1|Employment Date)'
>>> mixedlm_inference_result = predict_ml(
...     df,
...     formula=formula,
...     priority_models=['MixedLM'],
...     verbose=2
... )

Notes:

Pipelines Explained

Machine Learning Pipeline

  1. Data Preprocessing (optional): optionally prepares a dataset for machine learning by handling numerical, categorical, text, and datetime data.
    • It supports flexible imputation, scaling, encoding, and vectorization methods to cater to a wide range of preprocessing needs.

    • The function automatically splits the data into training and test sets and applies the preprocessing steps defined by the user.

    • It accommodates custom preprocessing steps for various data types, enhancing flexibility and control over the preprocessing pipeline.

  2. Evaluation of Untuned Models: leverages a composite score for model evaluation, which synthesizes scores across multiple metrics, weighted by the specified priorities. This method enables a holistic and nuanced model comparison, taking into account the multidimensional aspects of model performance.
    • Priority Metrics: Assigning weights (default: 5 for prioritized metrics, 1 for others) allows users to emphasize metrics they find most relevant, affecting the composite score calculation.

    • Composite Score: Calculated as a weighted average of metric scores, normalized by the total weight. This score serves as a basis for ranking models. The formula for the composite score is given by:

    \[C = \frac{\sum_{m \in M} (w_m \cdot \text{adj}(s_m))}{\sum_{m \in M} w_m}\]
    Where:
    • \(\text{adj}(s_m)\) is the score adjustment function, ensuring a consistent interpretation across metrics. Metrics for which lower values are traditionally better (e.g., RMSE, MAE) are inverted or negated prior to weight application, aligning all metrics to the “higher is better” principle for score calculation.

    • \(w_m\) represents the weight of metric \(m\).

    • \(M\) is the set of all metrics considered in the evaluation.

  3. Model Tuning: Uses top N untuned models to tune. Systematically applies grid search, random search, or Bayesian optimization to explore the hyperparameter space of given models. It supports customization of the tuning process through various parameters and outputs the best found configurations.

Statistical Inference Pipeline

  1. Determination of Task Type: First, the function identifies whether the analysis involves regression or classification. This categorization is based on the datatype of the target variable specified in the formula:
    • Regression: Applied if the target variable is numerical.

    • Classification: Applied if the target variable is categorical.

  2. Model Selection: Based on the task type determined in the previous step, the function selects from a pre-defined set of models suitable for either regression or classification:
    • Models and their respective functions are predefined in the models_classification_inference or models_regression_inference dictionaries, depending on whether the task is classification or regression.

    • The user has the option to limit the evaluation to a subset of models through the priority_models parameter, enhancing focus and computational efficiency.

  3. Model Evaluation: Each selected model is fitted to the data using the formula provided:
    • The function iterates over each model, passing any user-defined keyword arguments specific to that model using the model_kwargs dictionary. This allows for customized model configurations.

    • Models are fitted using their respective statistical functions from the statsmodels API, adhering to the specifications in the formula.

  4. Metrics Calculation: After fitting, the function evaluates each model using a set of predefined metrics appropriate for the task type:
    • Certain metrics, particularly those for which a lower value indicates better performance (e.g., AIC, BIC), are adjusted to fit a common scoring scheme where higher values indicate better model performance.

  5. Model Ranking and Output: Finally, the models are ranked based on their performance metrics:
    • A sorted list of models is generated based on the adjusted metrics, allowing the top-performing models to be identified.

    • The function returns the top n_top_models as specified, including their fitted model objects and performance metrics, facilitating further analysis or validation by the user.

    • If verbose output is enabled, the function provides detailed summaries of the top models, aiding in interpretive and diagnostic processes.

Available Metadata

Below you can find all of the models, scoring metrics and tuners predict_ml() is equipped with. We also provide default parameter grids for users who do not wish to provide it.

ML-oriented Models

Classification Models
  • LogisticRegression: Provides logistic regression for binary classification.

  • DecisionTreeClassifier: Offers decision tree algorithms for classification.

  • RandomForestClassifier: Implements a random forest for classification.

  • GradientBoostingClassifier: Applies gradient boosting techniques for classification.

  • SVC: Support Vector Classifier with enabled probability estimates.

  • KNeighborsClassifier: Utilizes k-nearest neighbors voting classification.

Regression Models
  • LinearRegression: Ordinary least squares Linear Regression.

  • Ridge: Ridge regression with L2 regularization.

  • Lasso: Lasso regression with L1 regularization.

  • DecisionTreeRegressor: Regression based on decision trees.

  • RandomForestRegressor: Random forest algorithm for regression.

  • GradientBoostingRegressor: Gradient boosting for regression.

  • SVR: Epsilon-Support Vector Regression.

  • KNeighborsRegressor: Regression based on k-nearest neighbors.

Inference-oriented Models

These models are specifically used for statistical inference, allowing for detailed statistical analysis.

Classification Inference Models
  • Logit: Logistic regression for binary classification.

  • Probit: Probit model for binary classification.

  • MNLogit: Multinomial logistic regression for handling multiple categories.

  • Poisson: Poisson model for count data.

  • NegativeBinomial: Negative binomial model for count data with over-dispersion.

  • GEE: Generalized Estimating Equations for longitudinal data.

  • NominalGEE: Generalized Estimating Equations for nominal responses.

  • OrdinalGEE: Generalized Estimating Equations for ordinal responses.

Regression Inference Models
  • OLS: Ordinary Least Squares for linear regression.

  • WLS: Weighted Least Squares for cases with non-constant variance.

  • GLS: Generalized Least Squares for regression with correlated errors.

  • RLM: Robust Linear Models for regression with outliers.

  • QuantReg: Quantile Regression for modeling different quantiles.

  • GLSAR: GLS with autoregressive error model.

  • MixedLM: Mixed Linear Model for hierarchical or longitudinal data.

  • PHReg: Proportional Hazards model for survival analysis.

Scoring Metrics

Classification Scoring Metrics
  • Accuracy, Balanced Accuracy: Measures overall and balanced accuracy.

  • Average Precision, F1 Score Variants: Assesses precision-recall balance.

  • Negative Log Loss: Negative log-likelihood of the classifier.

  • Precision, Recall, Jaccard Index: Evaluates the positive identified samples.

  • ROC AUC: Area Under the ROC Curve for model discrimination capability.

Regression Scoring Metrics
  • Explained Variance, MAE, MSE, RMSE, MSLE, MedAE: Measures of error and variance explained by the model.

  • R2, MPD, MGD, MAPE: Metrics for accuracy and prediction deviation.

Model Tuners

  • GridSearchCV: Exhaustive search over specified parameter values.

  • RandomizedSearchCV: Randomized search on hyper parameters.

  • BayesSearchCV: Bayesian approach to hyperparameter optimization.

Default Parameter Grids

We provide default parameter grids for users who do not wish to provide it. Feel free to use the ones below as a template for your own to use with predict_ml(custom_param_grids=..)

Parameter Grid for Classification Model Tuning

>>> default_param_grids_classification = {
...     'LogisticRegression': {
...         'C': [0.1, 1, 10, 100],
...         'penalty': ['l2'],
...         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
...     },
...     'DecisionTreeClassifier': {
...         'max_depth': [None, 10, 20, 30, 40, 50],
...         'min_samples_split': [2, 5, 10],
...         'min_samples_leaf': [1, 2, 4]
...     },
...     'RandomForestClassifier': {
...         'n_estimators': [100, 200, 300, 400],
...         'max_features': ['auto', 'sqrt'],
...         'max_depth': [None, 10, 20, 30, 40],
...         'min_samples_split': [2, 5, 10],
...         'min_samples_leaf': [1, 2, 4]
...     },
...     'GradientBoostingClassifier': {
...         'n_estimators': [100, 200, 300],
...         'learning_rate': [0.01, 0.1, 0.2, 0.5],
...         'max_depth': [3, 5, 7, 9]
...     },
...     'SVC': {
...         'C': [0.1, 1, 10, 100, 1000],
...         'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
...         'gamma': ['scale', 'auto']
...     },
...     'KNeighborsClassifier': {
...         'n_neighbors': [3, 5, 7, 9],
...         'weights': ['uniform', 'distance'],
...         'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
...     }
... }

Parameter Grid for Regression Model Tuning

>>> default_param_grids_regression = {
...     'LinearRegression': {
...         # Linear Regression usually does not need hyperparameter tuning except for regularization
...     },
...     'Ridge': {
...         'alpha': [0.1, 1.0, 10.0, 100.0],
...         'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
...     },
...     'Lasso': {
...         'alpha': [0.1, 1.0, 10.0, 100.0],
...         'selection': ['cyclic', 'random']
...     },
...     'DecisionTreeRegressor': {
...         'max_depth': [None, 10, 20, 30, 40, 50],
...         'min_samples_split': [2, 5, 10],
...         'min_samples_leaf': [1, 2, 4]
...     },
...     'RandomForestRegressor': {
...         'n_estimators': [100, 200, 300, 400],
...         'max_features': ['auto', 'sqrt'],
...         'max_depth': [None, 10, 20, 30, 40],
...         'min_samples_split': [2, 5, 10],
...         'min_samples_leaf': [1, 2, 4]
...     },
...     'GradientBoostingRegressor': {
...         'n_estimators': [100, 200, 300],
...         'learning_rate': [0.01, 0.1, 0.2, 0.5],
...         'max_depth': [3, 5, 7, 9]
...     },
...     'SVR': {
...         'C': [0.1, 1, 10, 100, 1000],
...         'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
...         'gamma': ['scale', 'auto']
...     },
...     'KNeighborsRegressor': {
...         'n_neighbors': [3, 5, 7, 9],
...         'weights': ['uniform', 'distance'],
...         'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
...     }
... }