evaluate_normality()

datasafari.evaluator.evaluate_normality(
df: DataFrame,
target_variable: str,
grouping_variable: str,
method: str = 'consensus',
pipeline: bool = False,
) dict | bool[source]

Evaluate normality of numerical data within groups defined by a categorical variable, employing multiple statistical tests, dynamically chosen based on data suitability.

This function offers a comprehensive examination of the distribution’s normality by utilizing tests like Shapiro-Wilk, Anderson-Darling, D’Agostino and Pearson’s test, and Lilliefors. Each test provides insights into different aspects of normality, and the consensus method integrates these perspectives to make a more informed decision.

Parameters:

dfpd.DataFrame

The DataFrame containing the data to be tested.

target_variablestr

The name of the numeric variable to test for normality.

grouping_variablestr

The name of the categorical variable used to create subsets of data for normality testing.

methodstr, optional, default: ‘consensus’
The method to use for testing normality.
  • 'shapiro' Shapiro-Wilk test

  • 'anderson' Anderson-Darling test

  • 'normaltest' D’Agostino and Pearson’s test

  • 'lilliefors' Lilliefors test

  • 'consensus' A combination of the above tests.

pipelinebool, optional, default: False
  • True Simplifies the output to a boolean indicating the consensus on normality. Useful for integration into automated analysis pipelines.

  • False The output is a dictionary containing results of the respective test(s).

Returns:

dict or bool
  • dict If pipeline=False, returns a dictionary with test names as keys and test results, including statistics, p-values, and normality conclusions, as values.

  • bool If pipeline=True, returns a boolean indicating the consensus on normality across all tests, or if consensus method was not used a boolean indicating the result of that test.

Raises:

TypeErrors:
  • If df is not a pandas DataFrame.

  • If target_variable or grouping_variable is not a string.

  • If method is not a string.

  • If pipeline is not a boolean.

ValueErrors:
  • If the df is empty.

  • If the target_variable or grouping_variable does not exist in the DataFrame.

  • If the method specified is not supported.

  • If the target_variable is not numerical, or if the grouping_variable is not categorical, as determined by evaluating their data types with evaluate_dtype().

Examples:

Example 1: Using the consensus method to evaluate normality in a DataFrame:

>>> import datasafari
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
...     'Group': np.random.choice(['A', 'B', 'C'], size=100),
...     'Data': np.random.normal(0, 1, size=100)
... })
>>> normality_result = evaluate_normality(df, 'Data', 'Group')

Example 2: Using the function in a comprehensive evaluation pipeline:

>>> pipeline_result = evaluate_normality(df, 'Data', 'Group', pipeline=True)
>>> if pipeline_result:
...     # your pipeline in the case normality is validated
... else:
...     # your pipeline in the case normality is not validated

Notes:

Consensus Method: The consensus method integrates results from multiple statistical tests to provide a comprehensive assessment of the normality of data distributions.

Here is how the consensus method operates:
  1. Test Execution:

    • Shapiro-Wilk Test: Assesses normality based on the correlation between data and corresponding normal scores, ideal for small sample sizes.

    • Anderson-Darling Test: Focuses more on the tails of the distribution, suitable for any sample size.

    • D’Agostino-Pearson Test: Combines skewness and kurtosis to assess normality, best for larger datasets.

    • Lilliefors Test: An adaptation of the Kolmogorov-Smirnov test that does not require known mean and variance, useful for small to medium samples.

  2. Outcome Evaluation: Each test provides a conclusion on whether the data follow a normal distribution.

  3. Majority Rule: The final consensus on normality is based on the majority of test outcomes. If more tests conclude ‘normal’, the consensus is that the data are normally distributed. Conversely, if more tests conclude ‘non-normal’, the consensus is that the data are not normally distributed.

  4. Tie-Breaker: In the event of a tie (an equal number of ‘normal’ and ‘non-normal’ outcomes), the results of the Shapiro-Wilk and Anderson-Darling tests are given precedence. These tests are chosen because of their robustness and widespread acceptance in statistical testing. If both agree, their conclusion is adopted; if they disagree, further analysis may be required to determine normality.

This structured approach ensures a thorough and balanced evaluation of normality, accommodating different sample sizes and distribution characteristics, which enhances the reliability of the statistical analysis.