gen_fex is a high-performance library for Probabilistic Feature Extraction and Generative Modeling, built on top of JAX. It is designed to handle high-dimensional, sparse time-series data, making it particularly effective for financial modeling in high-risk regimes.
This repository accompanies the manuscript "Generative Modeling for High-Dimensional Sparse Data: Probabilistic Feature Extraction in High-Risk Financial Regimes". It implements robust probabilistic models that outperform conventional methods in capturing non-linear, time-dependent features, especially during volatile market conditions.
- ๐ JAX-Accelerated: Leverages JAX for high-performance numerical computing and automatic differentiation.
-
scikit-learn Compatible: Fully compatible with the
scikit-learnAPI (fit,transform,inverse_transform), allowing seamless integration into existing ML pipelines. -
High-Dimensional Efficiency: Automatically handles the "Transpose Trick" (Dual formulation) to efficiently process datasets where features (
$D$ ) far exceed samples ($N$ ). - Missing Data Imputation: Robust reconstruction of missing values in sparse datasets.
-
Advanced Models:
- PPCA (Probabilistic PCA): A probabilistic framework for PCA that handles noise and missing data.
- PKPCA (Probabilistic Kernel PCA): Extends PPCA with kernel methods (e.g., RBF) and Wishart processes to capture non-linear structures.
- Python 3.10 or newer.
You can install the package directly from GitHub:
pip install git+https://github.com/AI-Ahmed/gen_fex.gitIf you want to contribute or modify the code:
-
Clone the repository:
git clone https://github.com/AI-Ahmed/gen_fex.git cd gen_fex -
Install using Flit:
pip install flit flit install --deps develop --extras test --symlink
Here is a simple example of how to use the PPCA and PKPCA classes.
import numpy as np
from gen_fex import PPCA, PKPCA
# 1. Generate synthetic high-dimensional data (Samples < Features)
# Shape: (N_samples, D_features)
N, D = 100, 1000
data = np.random.rand(N, D)
# 2. Initialize Models
# We choose a latent dimension q
q = 50
ppca = PPCA(q=q)
pkpca = PKPCA(q=q)
# 3. Fit Models
# The models automatically handle the high-dimensional nature (N < D)
print("Fitting PPCA...")
ppca.fit(data, use_em=True, verbose=1)
print("Fitting PKPCA...")
pkpca.fit(data, use_em=True, verbose=1)
# 4. Transform (Dimensionality Reduction)
latent_ppca = ppca.transform()
latent_pkpca = pkpca.transform()
print(f"Original Shape: {data.shape}")
print(f"PPCA Latent Shape: {latent_ppca.shape}") # (q, D) - Latent features
print(f"PKPCA Latent Shape: {latent_pkpca.shape}") # (q, D) - Latent features
# Note: The model decomposes X approx W @ Z
# W: (N, q) - Sample embeddings
# Z: (q, D) - Latent features (returned by transform)
# 5. Reconstruction (Inverse Transform)
recon_ppca = ppca.inverse_transform(latent_ppca)
print(f"Reconstructed Shape: {recon_ppca.shape}")PPCA in the matrix-variate setting models the observed data matrix
where:
-
$W \in \mathbb{R}^{N \times q}$ is the loading matrix, -
$\mu \in \mathbb{R}^{N \times D}$ is the mean matrix, -
$E \in \mathbb{R}^{N \times D}$ is the noise matrix.
The latent variables follow an isotropic Gaussian:
and the noise is modeled using a matrix-variate Gaussian distribution:
For high-dimensional data where the number of features
PKPCA extends this by mapping data into a non-linear feature space using a kernel function (e.g., RBF). Our implementation utilizes a Wishart Process prior for the covariance matrix, allowing for robust uncertainty quantification in the kernel space.
We evaluate our models on high-dimensional sparse financial data. Below are comparisons of model performance and reconstruction quality.
Fig. 2. Comparison of the negative log-likelihood (โ) between high-dimensional PPCA and PKPCA over 20 iterations (T4 GPU).
Fig. 8. Monthly reconstructed correlation between PPCA and PKPCA for assets in the R2 regime. PKPCA shows a clear divergence from PPCA, particularly between the IT and Materials sectors, reflecting sector-specific performance during this period.
.
โโโ gen_fex/ # Source code
โ โโโ _ppcax.py # PPCA implementation
โ โโโ _pkpcax.py # PKPCA implementation
โโโ tests/ # Unit tests
โโโ pyproject.toml # Project configuration
โโโ README.md # Documentation
To ensure everything is working correctly, run the test suite:
pytest tests/test.pyContributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository.
- Create your feature branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
This project is licensed under the Apache License 2.0.
If you use this software in your research, please cite our manuscript:
@article{ATWA2026113376,
title = {Generative modeling for high-dimensional sparse data: Probabilistic feature extraction in high-risk financial regimes},
journal = {Engineering Applications of Artificial Intelligence},
volume = {164},
pages = {113376},
year = {2026},
issn = {0952-1976},
doi = {https://doi.org/10.1016/j.engappai.2025.113376},
url = {https://www.sciencedirect.com/science/article/pii/S0952197625034074},
author = {Ahmed Nabil Atwa and Mohamed Kholief and Ahmed Sedky},
keywords = {Probabilistic principal component analysis, Probabilistic kernel principal component analysis, Wishart process, Missing value imputation, Information-driven bars, Hierarchical risk parity}
}