explore_num()

datasafari.explorer.explore_num(
df: DataFrame,
numerical_variables: List[str],
method: str = 'all',
output: str = 'print',
threshold_z: int = 3,
) Tuple[Dict, DataFrame] | None[source]

Explore numerical variables in a DataFrame and gain insights on distribution characteristics, outlier detection using multiple methods (Z-score, IQR, Mahalanobis), normality tests, skewness, kurtosis, correlation analysis, and multicollinearity detection.

Parameters:

dfpd.DataFrame

The DataFrame containing the numerical data to analyze.

numerical_variableslist

A list of strings representing the column names in df to be analyzed.

methodstr, optional, default: ‘all’
Specifies the analysis method to apply.
  • 'correlation_analysis' for analyzing the correlation between numerical variables.

  • 'distribution_analysis' for distribution characteristics, including skewness and kurtosis, and normality tests (Shapiro-Wilk, Anderson-Darling).

  • 'outliers_zscore' for outlier detection using the Z-score method.

  • 'outliers_iqr' for outlier detection using the Interquartile Range method.

  • 'outliers_mahalanobis' for outlier detection using the Mahalanobis distance.

  • 'multicollinearity' for detecting multicollinearity among the numerical variables.

  • 'all' to perform all available analyses.

outputstr, optional, default: ‘print’
Determines the output format.
  • 'print' to print the analysis results to the console.

  • 'return' to return the analysis results as a DataFrame or dictionaries, depending on the analysis type.

threshold_zint, optional, default; 3

Used in method 'outliers_zscore', users can define their preferred z-score threshold, if the default value does not fit their needs.

Returns:

Tuple[Dict, pd.DataFrame] or None.
  • Tuple[Dict, pd.DataFrame] For ‘correlation_analysis’, returns a DataFrame showing the correlation coefficients between variables if output is ‘return’.

  • Tuple[Dict, pd.DataFrame] For ‘distribution_analysis’, returns a DataFrame with distribution statistics if output is ‘return’.

  • Tuple[Dict, pd.DataFrame] For outlier detection methods (‘outliers_zscore’, ‘outliers_iqr’, ‘outliers_mahalanobis’), returns a dictionary mapping variables to their outlier values and a DataFrame of rows considered outliers if output is ‘return’.

  • Tuple[Dict, pd.DataFrame] For ‘multicollinearity’, returns a DataFrame or a Series indicating the presence of multicollinearity, such as VIF scores, if output is ‘return’.

  • Tuple[Dict, pd.DataFrame] If output=’return’ and method=’all’, returns a comprehensive summary of all analyses as text or a combination of DataFrames and dictionaries.

  • None If output=’print’ and method=’all’, returns nothing, but prints results to console.

Raises:

TypeErrors:
  • If df is not a pandas DataFrame.

  • If numerical_variables is not a list of strings.

  • If method is not a string.

  • If output is not a string.

  • If threshold_z is not a float or an int.

ValueErrors:
  • If the df is empty.

  • If method is not one of the specified valid methods.

  • If output is not ‘print’ or ‘return’.

  • If ‘numerical_variables’ list is empty.

  • If variables provided through ‘numerical_variables’ are not numerical variables.

  • If any specified variables in numerical_variables are not found in the DataFrame’s columns.

Examples:

Generating a sample DataFrame to demonstrate the functionality:

>>> import datasafari
>>> import numpy as np
>>> import pandas as pd
>>> data = {
...    'Feature1': np.random.normal(loc=0, scale=1, size=100),
...    'Feature2': np.random.exponential(scale=2, size=100),
...    'Feature3': np.random.randint(low=1, high=100, size=100)
... }
>>> df = pd.DataFrame(data)

The full potential of explore_num() is unlocked by simply providing a dataframe and the numerical columns to explore:

>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'])

Performing correlation analysis and printing the results:

>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='correlation_analysis', output='print')

Conducting distribution analysis and returning the results:

>>> distribution_results = explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='distribution_analysis', output='return')
>>> print(distribution_results)

Detecting outliers using the IQR method and printing the results:

>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='outliers_iqr', output='print')

Detecting outliers using the Z-score method with a custom threshold:

>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='outliers_zscore', threshold_z=2, output='print')

Identifying multicollinearity and printing VIF scores:

>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='multicollinearity', output='print')

Applying all analyses and printing comprehensive results:

>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='all', output='print')

Notes

  • Enhances interpretability by providing insights and conclusions based on the statistical tests and analyses conducted.

  • Normality tests assess whether data distribution departs from a normal distribution, which is crucial for certain statistical analyses.

  • Correlation analysis examines the strength and direction of relationships between numerical variables.

  • Multicollinearity detection is essential for regression analysis, as high multicollinearity can invalidate the model.