Freamon: Feature-Rich EDA, Analytics, and Modeling Toolkit

Freamon is a comprehensive Python toolkit for exploratory data analysis, feature engineering, and model development with a focus on practical data science workflows.

Quick Start | Documentation | Installation | Examples

Features

Exploratory Data Analysis: Automatic EDA with comprehensive reporting in HTML, Markdown, Excel, PowerPoint, and interactive Jupyter notebook displays
Advanced Multivariate Analysis: PCA visualization, correlation networks, and target-oriented analysis
Feature Engineering: Advanced feature engineering for numeric, categorical, and text data
Feature Selection: Statistical feature selection including Chi-square, ANOVA F-test, and effect size analysis
Deduplication: High-performance deduplication with Polars optimization (2-5x faster, 60-70% less memory), LSH, supervised ML, and active learning
Topic Modeling: Optimized text analysis with NMF and LDA, supporting large datasets up to 100K documents
Automated Modeling: Intelligent end-to-end modeling workflow for text, tabular, and time series data
Modeling: Custom model implementations with feature importance and model interpretation
Pipeline: Scikit-learn compatible pipeline with additional features
Drift Analysis: Tools for detecting and analyzing data drift
Word Embeddings: Integration with various word embedding techniques
Visualization: Publication-quality visualizations with proper handling of all special characters
Performance Optimization: Multiprocessing support and intelligent sampling for large dataset analysis

Installation

Basic Installation

For basic functionality (EDA, visualization, core deduplication):

pip install freamon

Installation with All Features

For full functionality including advanced modeling, text processing, and performance optimizations:

pip install "freamon[all]"

Feature-Specific Installation

For specific feature sets:

# For high-performance with Polars acceleration
pip install "freamon[performance]"

# For text analysis and topic modeling
pip install "freamon[topic_modeling]"

# For word embeddings support
pip install "freamon[word_embeddings]"

# For extended features (modeling, Polars, LightGBM, SHAP, etc.)
pip install "freamon[extended]"

# For Markdown report generation
pip install "freamon[markdown_reports]"

Dependencies by Feature

Here's what each optional dependency provides:

Core (always installed):
- numpy, pandas, scikit-learn, matplotlib, seaborn, networkx
Performance [freamon[performance]]:
- pyarrow - For faster data processing
Extended [freamon[extended]]:
- polars - High-performance DataFrame library (2-5x faster than pandas)
- lightgbm - Gradient boosting framework
- optuna - Hyperparameter optimization
- shap - Model explanation
- spacy - NLP processing
- statsmodels - Statistical modeling
- dask - Parallel computing
Topic Modeling [freamon[topic_modeling]]:
- gensim - Topic modeling
- pyldavis - Topic visualization
- wordcloud - Word cloud generation
Word Embeddings [freamon[word_embeddings]]:
- gensim - Word vectors
- nltk - Natural language toolkit
- spacy - Linguistic features
Markdown Reports [freamon[markdown_reports]]:
- markdown - Report generation

Quick Start

from freamon.eda import EDAAnalyzer

# Create an analyzer instance
analyzer = EDAAnalyzer(df, target_column='target')

# Run the analysis
analyzer.run_full_analysis()

# Generate a report
analyzer.generate_report('eda_report.html')

# Or a markdown report for version control
analyzer.generate_report('eda_report.md', format='markdown')

Key Components

Step-by-Step Workflow Example

Below is a complete workflow showing data type detection, EDA analysis, modeling, and reporting:

import pandas as pd
import numpy as np
from freamon.eda import EDAAnalyzer
from freamon.utils.datatype_detector import detect_datatypes
from freamon import auto_model

# Required for PowerPoint/Excel reports
# pip install "freamon[extended]"
from freamon.eda.export import export_to_powerpoint, export_to_excel

# 1. Load sample data
df = pd.read_csv('customer_data.csv')
print(f"Dataset shape: {df.shape}")

# 2. Run data type detection
datatype_results = detect_datatypes(df)
print("\nDetected data types:")
print(f"Text columns: {datatype_results['text_columns']}")
print(f"Categorical columns: {datatype_results['categorical_columns']}")
print(f"Numeric columns: {datatype_results['numeric_columns']}")
print(f"Date columns: {datatype_results['date_columns']}")

# 3. Generate data type detection report
from freamon.utils.datatype_fixes import save_detection_report
save_detection_report(
    datatype_results,
    'datatype_detection_report.html',
    title='Customer Data Type Detection'
)
print("\nData type detection report saved to 'datatype_detection_report.html'")

# 4. Run EDA analysis
analyzer = EDAAnalyzer(
    df,
    target_column='churn',  # For supervised analysis
    text_columns=datatype_results['text_columns'],
    categorical_columns=datatype_results['categorical_columns'],
    numeric_columns=datatype_results['numeric_columns'],
    datetime_columns=datatype_results['date_columns']
)
analyzer.run_full_analysis()

# 5. Generate EDA reports in different formats
analyzer.generate_report('eda_report.html')  # HTML report
analyzer.generate_report('eda_report.md', format='markdown')  # Markdown report
print("\nEDA reports generated in HTML and Markdown formats")

# 6. Export EDA results to PowerPoint for presentations
export_to_powerpoint(
    analyzer.get_report_data(),
    'eda_presentation.pptx',
    report_type='eda'
)
print("\nEDA results exported to PowerPoint")

# 7. Run automated modeling with data type detection
# Note: Install required dependencies for advanced modeling:
# pip install "freamon[extended,topic_modeling]"
results = auto_model(
    df=df,
    target_column='churn',
    problem_type='classification',
    # Use our detected data types
    text_columns=datatype_results['text_columns'],
    categorical_columns=datatype_results['categorical_columns'],
    date_column=datatype_results['date_columns'][0] if datatype_results['date_columns'] else None
)

# 8. Examine model results
print("\nModel Performance:")
for metric, value in results['metrics'].items():
    if 'mean' in metric:
        print(f"{metric}: {value:.4f}")

# 9. Plot model visualizations
fig1 = results['autoflow'].plot_metrics()
fig1.savefig('cv_metrics.png')

fig2 = results['autoflow'].plot_importance(top_n=15)
fig2.savefig('feature_importance.png')

# 10. Export model results to Excel
model_data = {
    'model_type': results['autoflow'].model_type,
    'metrics': results['metrics'],
    'feature_importance': results['feature_importance'],
    'training_date': pd.Timestamp.now().strftime('%Y-%m-%d')
}
export_to_excel(model_data, 'model_performance.xlsx', report_type='model')

# 11. Export model results to PowerPoint
export_to_powerpoint(model_data, 'model_presentation.pptx', report_type='model')
print("\nModel results exported to Excel and PowerPoint")

# 12. Make predictions on new data
new_data = pd.read_csv('new_customers.csv')
predictions = results['autoflow'].predict(new_data)
new_data['predicted_churn'] = predictions
new_data.to_csv('predictions.csv', index=False)
print("\nPredictions saved to 'predictions.csv'")

Comprehensive Deduplication Workflow

Complete step-by-step process for deduplication, including analysis, visualization, and modeling:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from freamon.deduplication.exact_deduplication import hash_deduplication 
from freamon.deduplication.lsh_deduplication import lsh_deduplication
from freamon.data_quality.duplicates import detect_duplicates, get_duplicate_groups
from examples.deduplication_tracking_example import IndexTracker
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# For network visualization (optional)
# pip install networkx
import networkx as nx

# 1. Load sample data with text and duplicates
df = pd.read_csv('data_with_duplicates.csv')
print(f"Original dataset shape: {df.shape}")

# 2. Analyze duplicates using built-in detection
duplicate_stats = detect_duplicates(df)
print(f"\nDuplicate analysis:")
print(f"Exact duplicates: {duplicate_stats['duplicate_count']} records")
print(f"Duplicate percentage: {duplicate_stats['duplicate_percent']:.2f}%")

# 3. Get duplicate groups for examination
duplicate_groups = get_duplicate_groups(df)
print(f"\nFound {len(duplicate_groups)} duplicate groups")
for i, group in enumerate(duplicate_groups[:3]):  # Show first 3 groups
    print(f"\nDuplicate group {i+1}:")
    print(df.iloc[group].head(1))  # Show one example from each group

# 4. Initialize index tracker to maintain mapping
tracker = IndexTracker().initialize_from_df(df)

# 5. Find duplicates using LSH (locality-sensitive hashing) for text similarity
print("\nRunning LSH deduplication...")
kept_indices, similarity_dict = lsh_deduplication(
    df['description'],
    threshold=0.8,
    num_bands=20,
    preprocess=True,
    return_similarity_dict=True
)

# 6. Analyze LSH results
print(f"LSH kept {len(kept_indices)} out of {len(df)} records ({len(kept_indices)/len(df)*100:.1f}%)")

# 7. Visualize similarity network (for smaller datasets)
if len(df) < 1000:
    import networkx as nx
    G = nx.Graph()
    
    # Add all nodes (documents)
    for i in range(len(df)):
        G.add_node(i)
    
    # Add edges (similarities)
    for doc_id, similar_docs in similarity_dict.items():
        for similar_id in similar_docs:
            G.add_edge(doc_id, similar_id)
    
    # Plot network
    plt.figure(figsize=(10, 8))
    pos = nx.spring_layout(G)
    nx.draw(G, pos, node_size=50, node_color='blue', alpha=0.6)
    plt.title('Document Similarity Network')
    plt.savefig('similarity_network.png')
    plt.close()
    print("\nSaved similarity network visualization to 'similarity_network.png'")

# 8. Create deduplicated dataframe
deduped_df = df.iloc[kept_indices].copy()

# 9. Update tracker with kept indices
tracker.update_from_kept_indices(kept_indices, deduped_df)

# 10. Train model on deduplicated data
print("\nTraining model on deduplicated data...")
X = deduped_df.drop(['target', 'description'], axis=1)  # Exclude text column
y = deduped_df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 11. Evaluate model
y_pred = model.predict(X_test)
print("\nModel performance on deduplicated test data:")
print(classification_report(y_test, y_pred))

# 12. Make predictions and generate results dataframe
y_pred_series = pd.Series(y_pred, index=X_test.index)
results_df = pd.DataFrame({'prediction': y_pred_series, 'actual': y_test})

# 13. Map results back to original dataset with all records
full_results = tracker.create_full_result_df(
    results_df, df, fill_value={'prediction': None, 'actual': None}
)

print(f"\nMapping results:")
print(f"Original dataset size: {len(df)}")
print(f"Deduplicated dataset size: {len(deduped_df)}")
print(f"Number of records with predictions: {full_results['prediction'].notna().sum()}")

# 14. Save full dataset with deduplication information
df['is_duplicate'] = ~df.index.isin(kept_indices)
df['has_prediction'] = full_results['prediction'].notna()
df['predicted'] = full_results['prediction']
df.to_csv('deduplication_results.csv', index=False)
print("\nSaved full dataset with deduplication and prediction information to 'deduplication_results.csv'")

Duplicate Flagging for Unlabeled Data

Comprehensive workflow to identify potential duplicates without removing them, with analysis and visualization:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Required for duplicate flagging functionality 
# pip install "freamon[extended]"
from freamon.deduplication.flag_duplicates import flag_similar_records, flag_text_duplicates

# Required for PowerPoint/Excel export
# pip install "freamon[extended]"
from freamon.eda.export import export_to_excel, export_to_powerpoint

# 1. Load unlabeled dataset
unlabeled_df = pd.read_csv('unlabeled_customer_data.csv')
print(f"Dataset shape: {unlabeled_df.shape}")

# 2. Flag potential text duplicates using LSH
print("\nProcessing text duplicates...")
text_df = flag_text_duplicates(
    unlabeled_df,
    text_column='description',
    threshold=0.8,
    method='lsh',
    add_group_id=True,
    add_similarity_score=True,
    add_duplicate_flag=True
)

# 3. Analyze text duplicate results
duplicate_text_groups = text_df['duplicate_group_id'].dropna().nunique()
duplicate_text_records = text_df['is_text_duplicate'].sum()
print(f"Text duplicate analysis:")
print(f"Found {duplicate_text_groups} potential duplicate text groups")
print(f"Found {duplicate_text_records} records ({duplicate_text_records/len(text_df)*100:.1f}%) with similar text")

# 4. Flag similar records across multiple fields using weighted similarity
print("\nProcessing multi-field similarity...")
similar_df = flag_similar_records(
    text_df,  # Use the dataframe that already has text duplicate info
    columns=['name', 'address', 'phone', 'email'],
    weights={'name': 0.4, 'address': 0.3, 'phone': 0.2, 'email': 0.1},
    threshold=0.7,
    similarity_column="similarity_score",  # Column to store similarity scores
    group_column="multifield_group_id",    # Column to store group IDs
    flag_column="is_multifield_duplicate"  # Column to store duplicate flags
)

# 5. Analyze multi-field similarity results
multifield_groups = similar_df['multifield_group_id'].dropna().nunique()
multifield_duplicates = similar_df['is_multifield_duplicate'].sum()
print(f"Multi-field duplicate analysis:")
print(f"Found {multifield_groups} potential duplicate groups based on multiple fields")
print(f"Found {multifield_duplicates} records ({multifield_duplicates/len(similar_df)*100:.1f}%) with similar fields")

# 6. Create a combined duplicate flag
similar_df['is_potential_duplicate'] = similar_df['is_text_duplicate'] | similar_df['is_multifield_duplicate']
total_duplicates = similar_df['is_potential_duplicate'].sum()
print(f"\nCombined results: {total_duplicates} potential duplicates ({total_duplicates/len(similar_df)*100:.1f}%)")

# 7. Visualize similarity score distribution
plt.figure(figsize=(10, 6))
sns.histplot(similar_df['similarity_score'].dropna(), bins=20)
plt.title('Distribution of Similarity Scores')
plt.xlabel('Similarity Score')
plt.ylabel('Count')
plt.axvline(x=0.7, color='r', linestyle='--', label='Threshold (0.7)')
plt.axvline(x=0.9, color='g', linestyle='--', label='High Similarity (0.9)')
plt.legend()
plt.savefig('similarity_distribution.png')
plt.close()
print("\nSaved similarity distribution chart to 'similarity_distribution.png'")

# 8. Create a group size analysis
group_sizes = similar_df[similar_df['multifield_group_id'].notna()].groupby('multifield_group_id').size()
plt.figure(figsize=(10, 6))
sns.histplot(group_sizes, bins=10)
plt.title('Duplicate Group Size Distribution')
plt.xlabel('Group Size')
plt.ylabel('Count')
plt.savefig('group_size_distribution.png')
plt.close()
print(f"Largest duplicate group has {group_sizes.max()} records")

# 9. Add confidence level based on combined evidence
similar_df['duplicate_confidence'] = 'None'
# Both text and multifield similarity = high confidence
similar_df.loc[(similar_df['is_text_duplicate']) & 
               (similar_df['is_multifield_duplicate']), 'duplicate_confidence'] = 'High'
# Only one method but high score = medium confidence
similar_df.loc[(similar_df['is_potential_duplicate']) & 
               (similar_df['similarity_score'] > 0.9) &
               (similar_df['duplicate_confidence'] == 'None'), 'duplicate_confidence'] = 'Medium'
# Flagged but lower score = low confidence
similar_df.loc[(similar_df['is_potential_duplicate']) & 
               (similar_df['duplicate_confidence'] == 'None'), 'duplicate_confidence'] = 'Low'

confidence_counts = similar_df['duplicate_confidence'].value_counts()
print("\nDuplicate confidence levels:")
for level, count in confidence_counts.items():
    print(f"{level} confidence: {count} records")

# 10. Export high confidence duplicates for review
high_confidence = similar_df[similar_df['duplicate_confidence'] == 'High']
medium_confidence = similar_df[similar_df['duplicate_confidence'] == 'Medium']

# 11. Create summary report with examples from each confidence level
report_data = []
for group_id in high_confidence['multifield_group_id'].dropna().unique()[:5]:  # Top 5 high confidence groups
    group_records = similar_df[similar_df['multifield_group_id'] == group_id]
    report_data.append({
        'confidence': 'High',
        'group_id': group_id,
        'group_size': len(group_records),
        'similarity_score': group_records['similarity_score'].mean(),
        'sample_records': group_records.head(2).to_dict('records')
    })

# 12. Export results in different formats
# Excel report
similar_df.to_csv('duplicate_analysis_complete.csv', index=False)
high_confidence.to_csv('high_confidence_duplicates.csv', index=False)
medium_confidence.to_csv('medium_confidence_duplicates.csv', index=False)

# 13. Export summary data for PowerPoint
summary_data = {
    'dataframe_size': len(similar_df),
    'duplicate_count': total_duplicates,
    'duplicate_percent': total_duplicates/len(similar_df)*100,
    'confidence_distribution': confidence_counts.to_dict(),
    'group_count': multifield_groups,
    'largest_group_size': group_sizes.max(),
    'similarity_scores': similar_df['similarity_score'].dropna().tolist(),
    'threshold': 0.7
}

# Create presentation-ready dictionary
presentation_data = {
    'metrics': {
        'dataset_size': len(similar_df),
        'duplicate_count': total_duplicates,
        'duplicate_percent': total_duplicates/len(similar_df)*100,
        'high_confidence': confidence_counts.get('High', 0),
        'medium_confidence': confidence_counts.get('Medium', 0),
        'low_confidence': confidence_counts.get('Low', 0),
    }
}

# 14. Export to PowerPoint (use model_type report since it has charts)
export_to_powerpoint(
    presentation_data, 
    'duplicate_analysis.pptx', 
    report_type='model'
)
print("\nExported reports to CSV files and PowerPoint")

print("\nDuplicate analysis complete.")

Performance Optimization for Large-Scale Deduplication

When working with large datasets, flag_similar_records offers powerful memory optimization options to balance performance and accuracy:

from freamon.deduplication.flag_duplicates import flag_similar_records

# For a dataset with 100,000+ records
result_df = flag_similar_records(
    large_df,
    columns=['name', 'address', 'phone', 'email'],
    weights={'name': 0.4, 'address': 0.3, 'phone': 0.2, 'email': 0.1},
    threshold=0.85,           # Higher threshold for precision
    chunk_size=500,           # Optimize memory usage
    max_comparisons=1000000,  # Limit total comparisons
    n_jobs=4,                 # Parallel processing
    use_polars=True           # Use Polars if available
)

Chunk Size and Accuracy Tradeoffs

The chunk_size parameter creates a fundamental tradeoff between memory efficiency and detection accuracy:

Dataset Size	Recommended Chunk Size	Recommended max_comparisons	Impact on Accuracy
< 20,000 rows	Non-chunked (None)	Default	Highest accuracy, full comparison
20,000-100,000 rows	1000-2000	1,000,000-3,000,000	Good balance of accuracy and memory usage
100,000-500,000 rows	500-1000	1,000,000-5,000,000	Some potential duplicates might be missed
> 500,000 rows	250-500	500,000-1,000,000	Focus on highest-quality matches

How it works:

Smaller chunks reduce memory usage dramatically but may miss some potential duplicates
The algorithm prioritizes within-chunk comparisons where duplicates are more likely
Connected components analysis helps capture relationships between records even across chunks
For critical applications, start with larger chunk sizes and decrease only if memory issues occur

For extremely large datasets, consider increasing the similarity threshold to focus on higher-quality matches:

# For very large datasets (500k+ records)
result_df = flag_similar_records(
    very_large_df,
    columns=columns,
    weights=weights,
    threshold=0.9,            # Higher threshold
    chunk_size=250,           # Very small chunks
    max_comparisons=500000,   # Limited comparisons
    n_jobs=8                  # More parallel workers
)

Advanced EDA and Feature Selection

Perform advanced multivariate analysis and feature selection:

from freamon.eda.advanced_multivariate import visualize_pca, analyze_target_relationships
from freamon.features.categorical_selection import chi2_selection, anova_f_selection

# PCA visualization with target coloring
fig, pca_results = visualize_pca(df, target_column='target')

# Target-oriented feature analysis
figures, target_results = analyze_target_relationships(df, target_column='target')

# Select important categorical features
selected_features, scores = chi2_selection(df, target='target', k=5, return_scores=True)

See Advanced EDA documentation for more details.

EDA Module

The EDA module provides comprehensive data analysis:

from freamon.eda import EDAAnalyzer

analyzer = EDAAnalyzer(df, target_column='target')
analyzer.run_full_analysis()

# Generate different types of reports
analyzer.generate_report('report.html')  # HTML report
analyzer.generate_report('report.md', format='markdown')  # Markdown report
analyzer.generate_report('report.md', format='markdown', convert_to_html=True)  # Both formats

# For Jupyter notebooks, display interactive report
analyzer.display_eda_report()  # Interactive display in notebook

Documentation

For more detailed information, refer to the examples directory and the following resources:

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.github/workflows		.github/workflows
dist_latest		dist_latest
docs/usage		docs/usage
examples		examples
freamon		freamon
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
DATATYPE_FIX_README.md		DATATYPE_FIX_README.md
DEV_GUIDE.md		DEV_GUIDE.md
EDA_FIX_README.md		EDA_FIX_README.md
FLAG_SIMILAR_RECORDS_GUIDE.md		FLAG_SIMILAR_RECORDS_GUIDE.md
FLAG_SIMILAR_RECORDS_PARAMETERS.md		FLAG_SIMILAR_RECORDS_PARAMETERS.md
LICENSE		LICENSE
LIGHTGBM_TUNING.md		LIGHTGBM_TUNING.md
LSH_DEDUPLICATION.md		LSH_DEDUPLICATION.md
NEXT_STEPS.md		NEXT_STEPS.md
OPTIMIZATION_GUIDE.md		OPTIMIZATION_GUIDE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
README_ADVANCED_EDA.md		README_ADVANCED_EDA.md
README_AUTO_MODEL.md		README_AUTO_MODEL.md
README_AUTO_SPLIT.md		README_AUTO_SPLIT.md
README_AUTO_TEXT_DETECTION.md		README_AUTO_TEXT_DETECTION.md
README_DATATYPE_FIX.md		README_DATATYPE_FIX.md
README_DEDUPLICATION.md		README_DEDUPLICATION.md
README_DEDUPLICATION_TRACKING.md		README_DEDUPLICATION_TRACKING.md
README_ENHANCED_REPORTING.md		README_ENHANCED_REPORTING.md
README_EXAMPLES.md		README_EXAMPLES.md
README_EXPORT.md		README_EXPORT.md
README_FIX.md		README_FIX.md
README_FLAG_SIMILAR_RECORDS_LSH.md		README_FLAG_SIMILAR_RECORDS_LSH.md
README_JUPYTER_DISPLAY.md		README_JUPYTER_DISPLAY.md
README_LSH_DEDUPLICATION.md		README_LSH_DEDUPLICATION.md
README_MARKDOWN_REPORTS.md		README_MARKDOWN_REPORTS.md
README_POLARS_DEDUPLICATION.md		README_POLARS_DEDUPLICATION.md
README_SUPERVISED_DEDUPLICATION.md		README_SUPERVISED_DEDUPLICATION.md
ROADMAP.md		ROADMAP.md
accordion_test.html		accordion_test.html
auto_model_plot_test.py		auto_model_plot_test.py
auto_model_test.py		auto_model_test.py
build_and_upload.py		build_and_upload.py
category_price_analysis.png		category_price_analysis.png
complex_currency_test.png		complex_currency_test.png
complex_currency_test.py		complex_currency_test.py
configure_matplotlib_for_currency.py		configure_matplotlib_for_currency.py
currency_display_fixed.png		currency_display_fixed.png
currency_display_test.py		currency_display_test.py
currency_test_plot.png		currency_test_plot.png
custom_datatype_report.html		custom_datatype_report.html
cv_metrics.png		cv_metrics.png
datatype_detection_report.html		datatype_detection_report.html
debug_conversion.py		debug_conversion.py
debug_datatype_display.py		debug_datatype_display.py
debug_month_year.py		debug_month_year.py
debug_month_year_complex.py		debug_month_year_complex.py
debug_real_scenario.py		debug_real_scenario.py
deduplication-suggestions.md		deduplication-suggestions.md
deduplication_steps.png		deduplication_steps.png
disable_matplotlib_latex.py		disable_matplotlib_latex.py
dollar_sign_test.png		dollar_sign_test.png
eda_output		eda_output
eda_report.md		eda_report.md
eda_report_with_html.md		eda_report_with_html.md
eda_report_with_html.md.html		eda_report_with_html.md.html
example_usage.py		example_usage.py
excel_test.py		excel_test.py
excel_test_date.py		excel_test_date.py
excel_test_fix.py		excel_test_fix.py
excel_test_mixed.py		excel_test_mixed.py
excel_test_mixed_fixed.py		excel_test_mixed_fixed.py
excel_test_overflow.py		excel_test_overflow.py
excel_test_real.py		excel_test_real.py
feature_groups_bar.png		feature_groups_bar.png
feature_groups_pie.png		feature_groups_pie.png
feature_importance.png		feature_importance.png
feature_importance_test.png		feature_importance_test.png
final_test.py		final_test.py
financial_data_report.html		financial_data_report.html
fix_dollar_signs_test.png		fix_dollar_signs_test.png
fix_dollar_signs_test.py		fix_dollar_signs_test.py
fix_formatting.py		fix_formatting.py
freamon-eda-improvements.md		freamon-eda-improvements.md
implementation_plan.md		implementation_plan.md
lsh_blocking_design.md		lsh_blocking_design.md
mock_glove.txt		mock_glove.txt
mock_word2vec.txt		mock_word2vec.txt
month_year_test.py		month_year_test.py
optimized_category_topic_distribution.png		optimized_category_topic_distribution.png
optimized_topic_model_visualization.html		optimized_topic_model_visualization.html
package_logo.webp		package_logo.webp
performance_test.py		performance_test.py
price_profit_analysis.png		price_profit_analysis.png
pyproject.toml		pyproject.toml
readme_example_feature_importance.png		readme_example_feature_importance.png
scientific_notation_example.png		scientific_notation_example.png
setup.py		setup.py
temp_pattern.txt		temp_pattern.txt
test_auto_model_prediction_example.py		test_auto_model_prediction_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Freamon: Feature-Rich EDA, Analytics, and Modeling Toolkit

Features

Installation

Basic Installation

Installation with All Features

Feature-Specific Installation

Dependencies by Feature

Quick Start

Key Components

Step-by-Step Workflow Example

Comprehensive Deduplication Workflow

Duplicate Flagging for Unlabeled Data

Performance Optimization for Large-Scale Deduplication

Chunk Size and Accuracy Tradeoffs

Advanced EDA and Feature Selection

EDA Module

Documentation

License

About

Uh oh!

Releases 29

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Freamon: Feature-Rich EDA, Analytics, and Modeling Toolkit

Features

Installation

Basic Installation

Installation with All Features

Feature-Specific Installation

Dependencies by Feature

Quick Start

Key Components

Step-by-Step Workflow Example

Comprehensive Deduplication Workflow

Duplicate Flagging for Unlabeled Data

Performance Optimization for Large-Scale Deduplication

Chunk Size and Accuracy Tradeoffs

Advanced EDA and Feature Selection

EDA Module

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 29

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages