explore_cat()

datasafari.explorer.explore_cat(
df: DataFrame,
categorical_variables: List[str],
method: str = 'all',
output: str = 'print',
) str | None[source]

Explore categorical variables within a DataFrame and gain insights on unique values, counts and percentages, and the entropy of variables to quantify data diversity.

Parameters:

dfpd.DataFrame

The DataFrame containing the data to be explored.

categorical_variableslist

A list of strings specifying the names of the categorical columns to explore.

methodstr, default: ‘all’
Specifies the method of exploration to apply.
  • 'unique_values' Lists unique values for each specified categorical variable.

  • 'counts_percentage' Shows counts and percentages for the unique values of each variable.

  • 'entropy' Calculates the entropy for each variable, providing a measure of data diversity. See the ‘calculate_entropy’ function for more details on entropy calculation.

  • 'all' Applies all the above methods sequentially.

outputstr, default: ‘print’
Determines the output format.
  • 'print' Prints the results to the console.

  • 'return' Returns the results as a single formatted string.

Returns:

str or None
  • str If output=’return’, a string containing the formatted exploration results is returned.

  • None If output=’print’, results are printed to the console, and the function returns None.

Raises:

TypeErrors:
  • If df is not a pandas DataFrame.

  • If categorical_variables is not a list or contains non-string elements.

  • If method or output is not a string.

ValueErrors:
  • If the df is empty.

  • If method is not one of the valid options.

  • If output is not one of the valid options.

  • If ‘categorical_variables’ list is empty.

  • If variables provided through ‘categorical_variables’ are not categorical variables.

  • If any of the specified categorical variables are not found in the DataFrame.

Examples:

Create a sample DataFrame to use in the examples:

>>> import datasafari
>>> import numpy as np
>>> import pandas as pd
>>> data = {
...     'Category1': np.random.choice(['Apple', 'Banana', 'Cherry'], size=100),
...     'Category2': np.random.choice(['Yes', 'No'], size=100),
...     'Category3': np.random.choice(['Low', 'Medium', 'High'], size=100)
... }
>>> df = pd.DataFrame(data)

The full potential of explore_cat() is unlocked by simply providing a dataframe and the categorical columns to explore:

>>> explore_cat(df, ['Category1', 'Category2', 'Category3'])

Display unique values for ‘Category1’ and ‘Category2’:

>>> explore_cat(df, ['Category1', 'Category2'], method='unique_values', output='print')

Explore counts and percentages for ‘Category1’ and ‘Category2’, then print the results:

>>> explore_cat(df, ['Category1', 'Category2'], method='counts_percentage', output='print')

Calculate and return the entropy of ‘Category1’, ‘Category2’, and ‘Category3’:

>>> result = explore_cat(df, ['Category1', 'Category2', 'Category3'], method='entropy', output='return')
>>> print(result)

Comprehensive exploration of all specified methods for ‘Category1’, ‘Category2’, and ‘Category3’, displaying to console:

>>> explore_cat(df, ['Category1', 'Category2', 'Category3'], method='all', output='print')

Using ‘all’ method to explore ‘Category1’ and ‘Category2’, returning the results as a string:

>>> result_str = explore_cat(df, ['Category1', 'Category2'], method='all', output='return')
>>> print(result_str)

Notes:

The 'entropy' method provides a quantitative measure of the unpredictability or diversity within each specified categorical column, calculated as outlined in the documentation for ‘calculate_entropy’. High entropy values indicate a more uniform distribution of categories, suggesting no single category overwhelmingly dominates.