Skip to content

nusdbsystem/NeurIDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeurIDA: Dynamic Modeling For Effective In-Database Analytics

Python PyTorch PyG Linting - flake8 License: MIT Status Conference

Source code for the paper "NeurIDA: Dynamic Modeling For Effective In-Database Analytics", especially for the algorithm part related to Dynamic In-Database Modeling (DIME).

Dynamic In-Database Modeling

overview

we present the DIME modeling framework, focusing on the composable base model architecture and its execution flow. When model augmentation is invoked, DIME executes a bespoke modeling pipeline tailored to the specific analytical task. The framework first builds a relational graph containing tuples from the target table and related tables, then dynamically constructs a bespoke model tailored to this graph using the selected base model and shared model components, and finally generates predictions for tuples in the target table using the constructed model.

Structure

neurida/
├── model/                      # Core model implementations
│   ├── aida.py                # Main AIDA framework (AIDAXFormer, relation modules, encoders)
│   ├── rdb.py                 # RDB baseline model with HeteroGraphSAGE
│   ├── base.py                # Base architecture components
│   ├── tabular/               # Tabular encoders (TabM, DeepFM, ARMNet, etc.)
│   └── layer/                 # Custom layers (fusion, relation convolution)
├── aida/                      # AIDA experiment framework
│   ├── aida_run.py           # Main training script
│   ├── prompt/               # LLM-based prompt generation
│   ├── db/                   # Database profiling utilities
│   └── run_*.sh              # Experiment shell scripts
├── utils/                     # Utility functions
│   ├── data/                 # Dataset implementations and factory
│   ├── builder.py            # Graph construction utilities
│   ├── sample.py             # Neighbor sampling
│   └── preprocess.py         # Type inference and preprocessing
├── cmds/                      # Command-line tools for baselines
├── data/                      # Data directory (download required)
└── environment.yml            # Conda environment specification

Installation and Data Setup

Environment Setup

# Clone the repository
git clone <repository-url>
cd neurida

# Create and activate conda environment
conda env create -f environment.yml
conda activate deepdb

Data Preparation

Supported Datasets:

  • H&M Fashion (hm): Fashion retail transactions
  • Avito (avito): Online classifieds platform
  • Event (event): Event attendance data
  • RateBeer (ratebeer): Beer ratings and reviews
  • OLIST (olist): Brazilian e-commerce
  • Trial (trial): Medical trial outcomes
  • Stack Overflow (stack): Developer engagement

The data/ directory contains two main types of artifacts:

  1. Tabular Data: Flattened relational data with various feature engineering levels
  2. TensorFrame Data: Materialized database graph structures stored as PyTorch tensors

Data files are excluded from git. You can download them from the official website or they will be generated automatically on first run. See data/README.md for more details.

Script Execution

Main Experiments

Run full experiments across all datasets and encoders:

bash aida/run_aida_experiments.sh

This runs experiments with multiple base encoders (mlp, tabm, dfm, resnet, fttrans) on classification and regression tasks.

Single dataset/encoder experiments:

bash aida/run_aida_single_dataset.sh    # Test on a specific dataset
bash aida/run_aida_single_encoder.sh    # Test a specific encoder

Ablation studies:

bash aida/run_aida_ablation.sh          # Test impact of model components
bash aida/run_aida_neighbor_size.sh     # Test neighbor sampling sizes
bash aida/run_aida_relation_args.sh     # Test relation module configurations
bash aida/run_aida_base_encoder.sh      # Compare base encoders

Baseline Experiments

Machine learning baselines:

bash aida/ml_baseline.sh                # XGBoost, LightGBM, CatBoost
bash aida/sklearn_baseline.sh           # Random Forest, etc.

Neural network baselines:

bash aida/fit_best_baseline.sh          # Best DNN baseline
bash aida/fit_medium_baseline.sh        # Medium DNN baseline
bash aida/fit_low_baseline.sh           # Low DNN baseline
bash aida/tpberta_medium_baseline.sh    # TPBerta baseline

Custom Training

For custom training with specific parameters:

python -m aida.aida_run \
    --db_name hm \
    --tf_cache_dir data/hm-tensor-frame \
    --task_name user-churn \
    --base_encoder tabm \
    --channels 128 \
    --relation_layer_num 2 \
    --num_neighbors 128 128 \
    --num_epochs 500

Key arguments:

  • --db_name: Database name (hm, avito, event, trial, ratebeer, olist, stack)
  • --task_name: Task name (e.g., user-churn, item-sales, user-repeat)
  • --base_encoder: Base encoder (mlp, tabm, dfm, resnet, fttrans, armnet)
  • --deactivate_fusion_module: Disable fusion module (ablation)
  • --deactivate_relation_module: Disable relation module (ablation)

Experiment Results

results

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •