explore_cat()¶
- datasafari.explorer.explore_cat(
- df: DataFrame,
- categorical_variables: List[str],
- method: str = 'all',
- output: str = 'print',
Explore categorical variables within a DataFrame and gain insights on unique values, counts and percentages, and the entropy of variables to quantify data diversity.
Parameters:¶
- dfpd.DataFrame
The DataFrame containing the data to be explored.
- categorical_variableslist
A list of strings specifying the names of the categorical columns to explore.
- methodstr, default: ‘all’
- Specifies the method of exploration to apply.
'unique_values'
Lists unique values for each specified categorical variable.'counts_percentage'
Shows counts and percentages for the unique values of each variable.'entropy'
Calculates the entropy for each variable, providing a measure of data diversity. See the ‘calculate_entropy’ function for more details on entropy calculation.'all'
Applies all the above methods sequentially.
- outputstr, default: ‘print’
- Determines the output format.
'print'
Prints the results to the console.'return'
Returns the results as a single formatted string.
Returns:¶
- str or None
str
If output=’return’, a string containing the formatted exploration results is returned.None
If output=’print’, results are printed to the console, and the function returns None.
Raises:¶
- TypeErrors:
If df is not a pandas DataFrame.
If categorical_variables is not a list or contains non-string elements.
If method or output is not a string.
- ValueErrors:
If the df is empty.
If method is not one of the valid options.
If output is not one of the valid options.
If ‘categorical_variables’ list is empty.
If variables provided through ‘categorical_variables’ are not categorical variables.
If any of the specified categorical variables are not found in the DataFrame.
Examples:¶
Create a sample DataFrame to use in the examples:
>>> import datasafari >>> import numpy as np >>> import pandas as pd >>> data = { ... 'Category1': np.random.choice(['Apple', 'Banana', 'Cherry'], size=100), ... 'Category2': np.random.choice(['Yes', 'No'], size=100), ... 'Category3': np.random.choice(['Low', 'Medium', 'High'], size=100) ... } >>> df = pd.DataFrame(data)
The full potential of
explore_cat()
is unlocked by simply providing a dataframe and the categorical columns to explore:>>> explore_cat(df, ['Category1', 'Category2', 'Category3'])
Display unique values for ‘Category1’ and ‘Category2’:
>>> explore_cat(df, ['Category1', 'Category2'], method='unique_values', output='print')
Explore counts and percentages for ‘Category1’ and ‘Category2’, then print the results:
>>> explore_cat(df, ['Category1', 'Category2'], method='counts_percentage', output='print')
Calculate and return the entropy of ‘Category1’, ‘Category2’, and ‘Category3’:
>>> result = explore_cat(df, ['Category1', 'Category2', 'Category3'], method='entropy', output='return') >>> print(result)
Comprehensive exploration of all specified methods for ‘Category1’, ‘Category2’, and ‘Category3’, displaying to console:
>>> explore_cat(df, ['Category1', 'Category2', 'Category3'], method='all', output='print')
Using ‘all’ method to explore ‘Category1’ and ‘Category2’, returning the results as a string:
>>> result_str = explore_cat(df, ['Category1', 'Category2'], method='all', output='return') >>> print(result_str)
Notes:¶
The
'entropy'
method provides a quantitative measure of the unpredictability or diversity within each specified categorical column, calculated as outlined in the documentation for ‘calculate_entropy’. High entropy values indicate a more uniform distribution of categories, suggesting no single category overwhelmingly dominates.