explore_cat()¶

datasafari.explorer.explore_cat( df: DataFrame, categorical_variables: List[str], method: str = 'all', output: str = 'print', ) → str | None[source]¶

Explore categorical variables within a DataFrame and gain insights on unique values, counts and percentages, and the entropy of variables to quantify data diversity.

Parameters:¶

dfpd.DataFrame

The DataFrame containing the data to be explored.

categorical_variableslist

A list of strings specifying the names of the categorical columns to explore.

methodstr, default: ‘all’

Specifies the method of exploration to apply.

'unique_values' Lists unique values for each specified categorical variable.
'counts_percentage' Shows counts and percentages for the unique values of each variable.
'entropy' Calculates the entropy for each variable, providing a measure of data diversity. See the ‘calculate_entropy’ function for more details on entropy calculation.
'all' Applies all the above methods sequentially.

outputstr, default: ‘print’

Determines the output format.

'print' Prints the results to the console.
'return' Returns the results as a single formatted string.

Returns:¶

str or None

str If output=’return’, a string containing the formatted exploration results is returned.
None If output=’print’, results are printed to the console, and the function returns None.

Raises:¶

TypeErrors:

If df is not a pandas DataFrame.
If categorical_variables is not a list or contains non-string elements.
If method or output is not a string.

ValueErrors:

If the df is empty.
If method is not one of the valid options.
If output is not one of the valid options.
If ‘categorical_variables’ list is empty.
If variables provided through ‘categorical_variables’ are not categorical variables.
If any of the specified categorical variables are not found in the DataFrame.

Examples:¶

Create a sample DataFrame to use in the examples:

>>> import datasafari
>>> import numpy as np
>>> import pandas as pd
>>> data = {
...     'Category1': np.random.choice(['Apple', 'Banana', 'Cherry'], size=100),
...     'Category2': np.random.choice(['Yes', 'No'], size=100),
...     'Category3': np.random.choice(['Low', 'Medium', 'High'], size=100)
... }
>>> df = pd.DataFrame(data)

The full potential of explore_cat() is unlocked by simply providing a dataframe and the categorical columns to explore:

>>> explore_cat(df, ['Category1', 'Category2', 'Category3'])

Display unique values for ‘Category1’ and ‘Category2’:

>>> explore_cat(df, ['Category1', 'Category2'], method='unique_values', output='print')

Explore counts and percentages for ‘Category1’ and ‘Category2’, then print the results:

>>> explore_cat(df, ['Category1', 'Category2'], method='counts_percentage', output='print')

Calculate and return the entropy of ‘Category1’, ‘Category2’, and ‘Category3’:

>>> result = explore_cat(df, ['Category1', 'Category2', 'Category3'], method='entropy', output='return')
>>> print(result)

Comprehensive exploration of all specified methods for ‘Category1’, ‘Category2’, and ‘Category3’, displaying to console:

>>> explore_cat(df, ['Category1', 'Category2', 'Category3'], method='all', output='print')

Using ‘all’ method to explore ‘Category1’ and ‘Category2’, returning the results as a string:

>>> result_str = explore_cat(df, ['Category1', 'Category2'], method='all', output='return')
>>> print(result_str)

Notes:¶

The 'entropy' method provides a quantitative measure of the unpredictability or diversity within each specified categorical column, calculated as outlined in the documentation for ‘calculate_entropy’. High entropy values indicate a more uniform distribution of categories, suggesting no single category overwhelmingly dominates.