evaluate_dtype()¶

datasafari.evaluator.evaluate_dtype( df: DataFrame, col_names: list, max_unique_values_ratio: float = 0.05, min_unique_values: int = 10, string_length_threshold: int = 50, small_dataset_threshold: int = 20, output: str = 'dict', ) → dict | list[source]¶

Evaluate and automatically categorize the data types of DataFrame columns, effectively distinguishing between ambiguous cases based on detailed logical assessments.

This function examines columns within a DataFrame to determine if they are numerical or categorical. It goes beyond simple data type checks by considering the distribution of unique values within numerical data. This allows for more nuanced categorization, where numerical columns with a limited range of unique values can be treated as categorical, based on specified thresholds.

Parameters:¶

dfpd.DataFrame

The DataFrame containing the data to be evaluated.

col_nameslist

A list of column names whose data types are to be evaluated.

max_unique_values_ratiofloat, optional, default: 0.05

The maximum ratio of unique values to total observations for a column to be considered categorical.

min_unique_valuesint, optional, default: 10

The minimum number of unique values for a column to be considered continuous. Columns below this threshold are categorized.

string_length_thresholdint, optional, default: 50

The average string length threshold above which a column is classified as text data.

small_dataset_thresholdint, optional, default: 20

The threshold for small datasets, below which the column is likely categorical.

outputstr, optional, default: ‘dict’

Specifies the format of the output.

'dict' Returns a dictionary mapping column names to their determined data types (‘numerical’, ‘categorical’, ‘text’, ‘datetime’).
'list_n' Returns a list of booleans indicating whether each column in col_names is numerical (True) or not (False).
'list_c' Returns a list of booleans indicating whether each column in col_names is categorical (True) or not (False).
'list_d' Returns a list of booleans indicating whether each column in col_names is datetime (True) or not (False).
'list_t' Returns a list of booleans indicating whether each column in col_names is text (True) or not (False).

Returns:¶

dict or list

Depending on the ‘output’ parameter, this function returns a dictionary or list.

dict If output=’dict’: A dictionary mapping column names to their determined data types (‘numerical’, ‘categorical’, ‘text’ or ‘datetime’).
list If output=’list_n/c/d/t’: A list of booleans indicating the nature of each column in col_names (in the same order), according to the specified ‘output’ parameter.

Raises:¶

TypeErrors:

If df is not a pandas DataFrame.
If col_names is not a list or if elements of col_names are not all strings.
If max_unique_values_ratio is not a float or an integer.
If min_unique_values, string_length_threshold, small_dataset_threshold are not integers.
If output is not a string or does not match one of the expected output format strings.

ValueErrors:

If the df is empty.
If max_unique_values_ratio is outside the range [0, 1].
If min_unique_values is less than 1, as at least one unique value is needed to categorize a column.
If string_length_threshold is less than or equal to 0, indicating an invalid threshold for text data classification.
If ‘col_names’ list is empty or any specified column names in col_names are not present in the DataFrame.
If the output string does not correspond to one of the valid options.

Examples:¶

Create an example DataFrame with mixed data types:

>>> import datasafari
>>> import pandas as pd
>>> import numpy as np
>>> data = {
...    'Age': np.random.randint(18, 35, 100),
...    'Income': np.random.normal(50000, 15000, 100),
...    'Department': np.random.choice(['HR', 'Tech', 'Admin'], 100),
...    'Entry Date': pd.date_range(start='2021-01-01', periods=100, freq='M')
... }
>>> df = pd.DataFrame(data)

Evaluating data types with a dictionary output format:

>>> data_type_dict = evaluate_dtype(df, ['Age', 'Income', 'Department', 'Entry Date'], output='dict')
>>> print(data_type_dict)

Evaluating data types with a list output format indicating numerical data:

>>> numerical_bool_list = evaluate_dtype(df, ['Age', 'Income', 'Department', 'Entry Date'], output='list_n')
>>> print(numerical_bool_list)

Evaluating data types with a list output format indicating categorical data:

>>> categorical_bool_list = evaluate_dtype(df, ['Age', 'Income', 'Department', 'Entry Date'], output='list_c')
>>> print(categorical_bool_list)

Evaluating data types with a list output format indicating datetime data:

>>> datetime_bool_list = evaluate_dtype(df, ['Age', 'Income', 'Department', 'Entry Date'], output='list_d')
>>> print(datetime_bool_list)