Quick Start¶
Getting started with DataSafari is straightforward.
Install it using pip in your terminal:
pip install datasafari
Or install it using Poetry:
poetry add datasafari
Import DataSafari in your Python script:
import datasafari as ds
For detailed installation options, including installing from the source, see the Installation Guide.
Hypothesis Testing? One line.¶
from datasafari.predictor import predict_hypothesis
import pandas as pd
import numpy as np
# Create a sample DataFrame
df_hypothesis = pd.DataFrame({
'Group': np.random.choice(['Control', 'Treatment'], size=100),
'Score': np.random.normal(0, 1, 100)
})
# Perform hypothesis testing
results = predict_hypothesis(df_hypothesis, 'Group', 'Score')
How DataSafari Streamlines Hypothesis Testing:
Automatic Test Selection: Depending on the data types,
predict_hypothesis()
automatically selects the appropriate test. It uses Chi-square, Fisher’s exact test or other exact tests for categorical pairs, and T-tests, ANOVA and others for categorical and numerical combinations, adapting based on group counts, sample size and data distribution.- Assumption Verification: Essential assumptions for the chosen tests are automatically checked.
Normality: Normality is verified using tests like Shapiro-Wilk or Anderson-Darling, essential for parametric tests.
Variance Homogeneity: Tests such as Levene’s or Bartlett’s are used to confirm equal variances, informing the choice between ANOVA types.
- Comprehensive Output:
Justifications: Provides comprehensive reasoning on all test choices.
Test Statistics: Key quantitative results from the hypothesis test.
P-values: Indicators of the statistical significance of the findings.
Conclusions: Clear textual interpretations of whether the results support or reject the hypothesis.
Machine Learning? You guessed it.¶
from datasafari.predictor import predict_ml
import pandas as pd
import numpy as np
# Create another sample DataFrame for ML
df_ml = pd.DataFrame({
'Age': np.random.randint(20, 60, size=100),
'Salary': np.random.normal(50000, 15000, size=100),
'Experience': np.random.randint(1, 20, size=100)
})
x_cols = ['Age', 'Experience'] # Feature columns
y_col = 'Salary' # Target column
# Find the best models for your data
best_models = predict_ml(df_ml, x_cols, y_col)
How DataSafari Simplifies Machine Learning Model Selection:
- Tailored Data Preprocessing: The function automatically processes various types of data (numerical, categorical, text, datetime), preparing them optimally for machine learning.
Numerical data might be scaled or normalized.
Categorical data can be encoded.
Text data might be vectorized using techniques suitable for the analysis.
- Intelligent Model Evaluation: The function evaluates a variety of models using a composite score that synthesizes performance across multiple metrics, taking into account the multidimensional aspects of model performance.
Composite Score Calculation: Scores for each metric are weighted according to specified priorities by the user, with lower weights assigned to non-priority metrics (e.g. RMSE over MAE). This composite score serves as a holistic measure of model performance, ensuring that the models recommended are not just good in one aspect but are robust across multiple criteria.
- Automated Hyperparameter Tuning: Once the top models are identified based on the composite score, the pipeline employs techniques like grid search, random search, or Bayesian optimization to fine-tune the models.
Output of Tuned Models: The best configurations for the models are output, along with their performance metrics, allowing users to make informed decisions about which models to deploy based on robust, empirically derived data.
- Customization Options & Sensible Defaults: Users can define custom hyperparameter grids, select specific tuning algorithms, prioritize models, tailor data preprocessing, and prioritize metrics.
Accessibility: Every part of the process is in the hands of the user, but sensible defaults are provided for ultimate simplicity of use, which is the approach for
datasafari
in general.