A Python package for unsupervised and semi-supervised analysis of multivariate Bernoulli data using Bernoulli Mixture Models (BMMs). This package provides implementations for both frequentist (EM algorithm) and Bayesian (Gibbs sampling) estimation methods.
Bernoulli Mixture Models are probabilistic models for clustering binary/multivariate Bernoulli data. They are particularly useful for:
- Clustering binary feature vectors
- Document classification with binary bag-of-words representations
- Semi-supervised text classification
- Any application involving multivariate binary data with latent group structure
- Maximum Likelihood Estimation via Expectation-Maximization (EM) algorithm
- Bayesian Estimation via Gibbs sampling with conjugate priors
- Semi-supervised Learning capabilities for text classification
- Naive Bayes classifier implementation for text data
- Vectorized implementations for efficient computation
- Numerical stability through log-sum-exp trick
pip install -r requirements.txtOr install the package in development mode:
pip install -e .- Python 3.8+
- numpy >= 1.24.1
- pandas >= 1.5.3
- scikit-learn >= 1.2.1
- matplotlib
- statsmodels >= 0.13.5
- seaborn
- tqdm
from bernmix.utils import bmm_utils as bmm
import numpy as np
# Generate synthetic data from a BMM
N = 1000 # sample size
K = 3 # number of mixture components
D = 20 # number of binary features
# Define true parameters
p_true = np.array([0.3, 0.5, 0.2]) # mixture weights
theta_true = np.random.beta(0.7, 0.9, size=(D, K)) # success probabilities
# Sample from the mixture
X, Z = bmm.sample_bmm(N, p_true, theta_true)
# Fit using EM algorithm
p_0 = np.random.dirichlet(np.ones(K)) # initial mixture weights
theta_0 = np.random.beta(1, 1, size=(D, K)) # initial success probs
logli, p_em, theta_em, latent_em = bmm.mixture_EM(
X=X, p_0=p_0, theta_0=theta_0, n_iter=500, stopcrit=1e-3
)bmm_mix/
├── README.md # This file
├── requirements.txt # Package dependencies
├── setup.py # Package installation script
└── bernmix/ # Main package directory
├── __init__.py
├── BernMix.py # Demo script for BMM estimation
├── semi_supervised_text_classify.py # Semi-supervised classification
├── notebooks/ # Jupyter notebooks with tutorials
│ ├── EM_for_BMM.ipynb # EM algorithm tutorial
│ └── Gibbs_for_BMM.ipynb # Gibbs sampling tutorial
└── utils/ # Utility modules
├── __init__.py
├── bmm_utils.py # Core BMM algorithms (EM, Gibbs)
├── gibbs_bmm.py # Gibbs sampler development/scratch
└── utils.py # Helper functions and Naive Bayes
| File | Description |
|---|---|
bernmix/BernMix.py |
Main demonstration script showing both EM and Gibbs sampling estimation on synthetic data |
bernmix/semi_supervised_text_classify.py |
Semi-supervised text classification using BMM and Naive Bayes |
| File | Description |
|---|---|
bernmix/utils/bmm_utils.py |
Core algorithms including: sample_bmm() for data generation, E_step() / M_step() for EM, mixture_EM() for full EM fitting, gibbs_pass() for Gibbs sampling, loglike() for likelihood computation |
bernmix/utils/utils.py |
Helper utilities including: train_test_split_extend() for data splitting, naiveBayes class for Bernoulli Naive Bayes, make_nb_feat for text feature extraction |
bernmix/utils/gibbs_bmm.py |
Development script for Gibbs sampler implementation |
| Notebook | Description |
|---|---|
notebooks/EM_for_BMM.ipynb |
Tutorial on fitting BMMs using the EM algorithm (Bishop, 2006) |
notebooks/Gibbs_for_BMM.ipynb |
Tutorial on Bayesian estimation using Gibbs sampling |
The EM algorithm iteratively maximizes the incomplete log-likelihood by:
- E-step: Compute posterior probabilities of cluster assignments given current parameters
- M-step: Update mixture weights and success probabilities given posterior assignments
Reference: Bishop (2006), Pattern Recognition and Machine Learning, Chapter 9.
The Gibbs sampler uses conjugate priors:
- Mixture weights: Dirichlet prior
- Success probabilities: Beta priors
The sampler alternates between:
- Drawing latent cluster assignments from categorical distributions
- Drawing mixture weights from Dirichlet full conditional
- Drawing success probabilities from Beta full conditionals
from bernmix.utils import bmm_utils as bmm
# Fit model
logli, p_em, theta_em, latent_em = bmm.mixture_EM(
X=X, p_0=p_0, theta_0=theta_0,
n_iter=500, stopcrit=1e-3, verbose=True
)
# Plot convergence
import matplotlib.pyplot as plt
plt.plot(logli[5:])
plt.xlabel('Iteration')
plt.ylabel('Log-likelihood')
plt.title('EM Convergence')
plt.show()from bernmix.utils import bmm_utils as bmm
import numpy as np
MC = 2000 # Monte Carlo iterations
burn_in = 500 # Burn-in period
# Initialize storage
p_draws = np.empty((MC, K))
theta_draws = np.empty((MC, D, K))
latent_draws = np.empty((MC, N))
# Run Gibbs sampler
for i in range(1, MC):
latent_draws[i,:], p_draws[i,:], theta_draws[i,:,:] = bmm.gibbs_pass(
p_draws[i-1,:], theta_draws[i-1,:,:], X,
alphas=alphas, hyper_para={'gammas': gammas, 'deltas': deltas}
)
# Posterior means (Bayes estimates)
p_bayes = np.mean(p_draws[burn_in:], axis=0)
theta_bayes = np.mean(theta_draws[burn_in:], axis=0)from bernmix.utils import utils
# Initialize and fit
nb = utils.naiveBayes()
class_prior_prob, class_cond_prob = nb.fit(corpus, labels)A
where:
-
$\pi_k$ are the mixture weights ($\sum_k \pi_k = 1$ ) -
$\theta_{dk}$ is the success probability for dimension$d$ in component$k$ -
$x_d \in {0, 1}$ is the$d$ -th binary feature
This project is licensed under the MIT License - see the LICENSE.md file for details.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.