Skip to content

AI-Ahmed/gen_fex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

74 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Probabilistic Feature Extraction in JAX

Python 3.10+ License JAX Code Style: Black

Overview

gen_fex is a high-performance library for Probabilistic Feature Extraction and Generative Modeling, built on top of JAX. It is designed to handle high-dimensional, sparse time-series data, making it particularly effective for financial modeling in high-risk regimes.

This repository accompanies the manuscript "Generative Modeling for High-Dimensional Sparse Data: Probabilistic Feature Extraction in High-Risk Financial Regimes". It implements robust probabilistic models that outperform conventional methods in capturing non-linear, time-dependent features, especially during volatile market conditions.

โœจ Key Features

  • ๐Ÿš€ JAX-Accelerated: Leverages JAX for high-performance numerical computing and automatic differentiation.
  • scikit-learn Compatible: Fully compatible with the scikit-learn API (fit, transform, inverse_transform), allowing seamless integration into existing ML pipelines.
  • High-Dimensional Efficiency: Automatically handles the "Transpose Trick" (Dual formulation) to efficiently process datasets where features ($D$) far exceed samples ($N$).
  • Missing Data Imputation: Robust reconstruction of missing values in sparse datasets.
  • Advanced Models:
    • PPCA (Probabilistic PCA): A probabilistic framework for PCA that handles noise and missing data.
    • PKPCA (Probabilistic Kernel PCA): Extends PPCA with kernel methods (e.g., RBF) and Wishart processes to capture non-linear structures.

๐Ÿ› ๏ธ Installation

Prerequisites

  • Python 3.10 or newer.

Install via pip

You can install the package directly from GitHub:

pip install git+https://github.com/AI-Ahmed/gen_fex.git

Development Installation

If you want to contribute or modify the code:

  1. Clone the repository:

    git clone https://github.com/AI-Ahmed/gen_fex.git
    cd gen_fex
  2. Install using Flit:

    pip install flit
    flit install --deps develop --extras test --symlink

๐Ÿš€ Quick Start

Here is a simple example of how to use the PPCA and PKPCA classes.

import numpy as np
from gen_fex import PPCA, PKPCA

# 1. Generate synthetic high-dimensional data (Samples < Features)
# Shape: (N_samples, D_features)
N, D = 100, 1000
data = np.random.rand(N, D)

# 2. Initialize Models
# We choose a latent dimension q
q = 50
ppca = PPCA(q=q)
pkpca = PKPCA(q=q)

# 3. Fit Models
# The models automatically handle the high-dimensional nature (N < D)
print("Fitting PPCA...")
ppca.fit(data, use_em=True, verbose=1)

print("Fitting PKPCA...")
pkpca.fit(data, use_em=True, verbose=1)

# 4. Transform (Dimensionality Reduction)
latent_ppca = ppca.transform()
latent_pkpca = pkpca.transform()

print(f"Original Shape: {data.shape}")
print(f"PPCA Latent Shape: {latent_ppca.shape}")   # (q, D) - Latent features
print(f"PKPCA Latent Shape: {latent_pkpca.shape}") # (q, D) - Latent features

# Note: The model decomposes X approx W @ Z
# W: (N, q) - Sample embeddings
# Z: (q, D) - Latent features (returned by transform)

# 5. Reconstruction (Inverse Transform)
recon_ppca = ppca.inverse_transform(latent_ppca)
print(f"Reconstructed Shape: {recon_ppca.shape}")

๐Ÿงฎ Mathematical Background

Probabilistic PCA (PPCA)

PPCA in the matrix-variate setting models the observed data matrix $P โˆˆ โ„^{Nร—D}$ using latent variables $Z โˆˆ โ„^{qร—N}$ with a linear Gaussian generative structure:

$$ P = WZ + \mu + E $$

where:

  • $W \in \mathbb{R}^{N \times q}$ is the loading matrix,
  • $\mu \in \mathbb{R}^{N \times D}$ is the mean matrix,
  • $E \in \mathbb{R}^{N \times D}$ is the noise matrix.

The latent variables follow an isotropic Gaussian:

$$ Z \sim \mathcal{N}(0, I_q) $$

and the noise is modeled using a matrix-variate Gaussian distribution:

$$ E \sim \mathcal{N}_{N \times D} ( 0,\ \sigma^2 I_N,\ I_D ). $$

Dual Formulation & The Transpose Trick

For high-dimensional data where the number of features $D$ is much larger than the number of samples $N$ ($D \gg N$), standard PCA is computationally expensive ( $O(D^3)$ ). gen_fex implements the Dual PPCA formulation (often called the "Transpose Trick"), which operates on the Nร—N Gram matrix instead of the Dร—D covariance matrix, significantly reducing computational cost to $O(N^3)$.

Probabilistic Kernel PCA (PKPCA)

PKPCA extends this by mapping data into a non-linear feature space using a kernel function (e.g., RBF). Our implementation utilizes a Wishart Process prior for the covariance matrix, allowing for robust uncertainty quantification in the kernel space.

๐Ÿ“Š Results & Performance

We evaluate our models on high-dimensional sparse financial data. Below are comparisons of model performance and reconstruction quality.

Model Performance Comparison

Model Performance Comparison Fig. 2. Comparison of the negative log-likelihood (โ„“) between high-dimensional PPCA and PKPCA over 20 iterations (T4 GPU).

Reconstructed Data Comparison

Reconstructed Data Comparison Fig. 8. Monthly reconstructed correlation between PPCA and PKPCA for assets in the R2 regime. PKPCA shows a clear divergence from PPCA, particularly between the IT and Materials sectors, reflecting sector-specific performance during this period.

๐Ÿ“ Directory Structure

.
โ”œโ”€โ”€ gen_fex/            # Source code
โ”‚   โ”œโ”€โ”€ _ppcax.py       # PPCA implementation
โ”‚   โ””โ”€โ”€ _pkpcax.py      # PKPCA implementation
โ”œโ”€โ”€ tests/              # Unit tests
โ”œโ”€โ”€ pyproject.toml      # Project configuration
โ””โ”€โ”€ README.md           # Documentation

๐Ÿงช Running Tests

To ensure everything is working correctly, run the test suite:

pytest tests/test.py

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository.
  2. Create your feature branch (git checkout -b feature/AmazingFeature).
  3. Commit your changes (git commit -m 'Add some AmazingFeature').
  4. Push to the branch (git push origin feature/AmazingFeature).
  5. Open a Pull Request.

๐Ÿ“„ License

This project is licensed under the Apache License 2.0.

๐Ÿ“ฃ Citation

If you use this software in your research, please cite our manuscript:

@article{ATWA2026113376,
title = {Generative modeling for high-dimensional sparse data: Probabilistic feature extraction in high-risk financial regimes},
journal = {Engineering Applications of Artificial Intelligence},
volume = {164},
pages = {113376},
year = {2026},
issn = {0952-1976},
doi = {https://doi.org/10.1016/j.engappai.2025.113376},
url = {https://www.sciencedirect.com/science/article/pii/S0952197625034074},
author = {Ahmed Nabil Atwa and Mohamed Kholief and Ahmed Sedky},
keywords = {Probabilistic principal component analysis, Probabilistic kernel principal component analysis, Wishart process, Missing value imputation, Information-driven bars, Hierarchical risk parity}
}