Inspired by the need for flexible and efficient tools in Bayesian nonparametric clustering, this repository presents a C++ framework that leverages Markov Chain Monte Carlo (MCMC) methods. The framework is designed to be modular and extensible, allowing users to implement various stochastic processes, likelihood models, and sampling strategies.
Documentation: https://filippo-galli.github.io/BNPClust/
License: GPL-3.0 β See LICENSE file for details
To get started quickly, clone the repository and open it in R/RStudio. Ensure you have the required R packages installed (Rcpp, RcppEigen).
git clone https://github.com/filippo-galli/BNPClust.git
cd BNPClustOpen R/mcmc_loop.R to create your own preferred MCMC scheme and see how to create and pass data to each component.
An example of a complete MCMC scheme is provided in R/launcher.R, which uses R/mcmc_loop.R as the underlying MCMC loop.
R/: R interface functions and scripts for MCMC orchestration and analysissrc/: C++ source code implementing the core frameworkdocs/: Generated documentation (Doxygen)doxygen-theme/: Doxygen theme configuration files
Each C++ class in the framework has corresponding R bindings to facilitate interaction with R users. The typical workflow is:
R/launcher.R: Entry point to load data, set parameters, invoke MCMC functions, and save resultsR/mcmc_loop.R: The MCMC iteration loop where you select the combination of Process, Likelihood, and Sampler to useR/mcmc_analysis.R: Functions to visualize and summarize MCMC output- Other scripts provide utility functions for data loading, visualization, result plotting, and data fetching/cleaning
The C++ source code is organized into the following subdirectories:
src/processes/: Implementations of stochastic processes (e.g., Dirichlet Process, Normalized Generalized Gamma Process) and their modular extensions (continuous covariate, spatial, and binary covariate modules)src/likelihoods/: Likelihood model implementations, including distance-based clustering, gamma likelihood, and null likelihoodsrc/samplers/: MCMC sampling algorithms (Neal's Algorithm 3, Split-Merge variants, SAMS, etc.)src/utils/: Utility functions, base classes, and shared infrastructuresrc/bindings.cpp: Rcpp bindings exposing C++ classes and functions to R
The docs/ folder contains documentation generated with Doxygen, including class diagrams and detailed code documentation. To regenerate the documentation, ensure Doxygen is installed and run:
doxygen DoxyfileThe framework is built around five main logical components:
Params: Manages model hyperparameters and MCMC configurationLikelihood: Defines the data observation modelProcess: Handles the Bayesian nonparametric priorData: Manages cluster assignments and data handlingSampler: Implements the MCMC inference engine
Currently implemented processes:
- Dirichlet Process (DP): The foundational nonparametric prior
- Normalized Generalized Gamma Process (NGGP): A flexible family of which the DP is a special case
The framework supports modular extensions to incorporate domain-specific structure:
- Continuous Covariate Module: Incorporates continuous covariates into the clustering process
- Spatial Module: Accounts for spatial dependencies given a neighbor adjacency matrix
- Binary Covariate Module: Incorporates discrete covariates into the clustering process
Cached versions of these modules are available for improved computational performance.
All likelihood components are implemented as log-likelihood functions for numerical stability. Currently available:
- Distance-based Clustering: Based on "Cohesion and Repulsion in Bayesian Distance Clustering" (Natarajan et al., 2023)
- Gamma Likelihood: Variant of distance-based clustering without the repulsion term
- Null Likelihood: A placeholder that does not contribute to the posterior, useful for prior inspection
The framework includes several MCMC sampling strategies suited to different problem structures:
- Neal's Algorithm 3: Conjugate Gibbs sampler for efficient cluster updates
- ZDNAM: Gibbs sampling with Zero-self Downward Nested Antithetic Modification (Neal, 2024) β experimental
- Split-Merge (Jain & Neal, 2004): Standard Split-Merge MCMC procedure
- SAMS: Sequentially-Allocated Merge-Split sampler (Dahl, 2021)
- LSS Split-Merge: Locality Sensitive Sampling for scalable Split-Merge (Luo et al., 2018)
- LSS-SDDS Split-Merge: Split-Merge with Smart-Dumb/Dumb-Smart moves, LSS, and SAMS enhancements
Status Legend: Core methods (Neal's Algorithm 3, standard Split-Merge) are production-ready; samplers marked experimental are under development.
- R (version 3.5.0 or later)
- C++ Compiler supporting C++11 or later (e.g., g++, clang)
- R Packages:
Rcpp,RcppEigen
-
Clone the repository:
git clone https://github.com/filippo-galli/BNPClust.git cd BNPClust -
Open the project in R/RStudio
-
Install dependencies:
install.packages(c("Rcpp", "RcppEigen"))
The Rcpp bindings in src/bindings.cpp expose all C++ classes and functions to R, allowing you to compose custom MCMC schemes directly from R. Use R/launcher.R as a template:
- Configure your Process, Likelihood, and Sampler in
R/mcmc_loop.R - Prepare your data and hyperparameters in
R/launcher.R - Execute
R/launcher.Rto run the MCMC chain - Analyze results using
R/mcmc_analysis.R
- Natarajan, L., et al. (2023). Cohesion and Repulsion in Bayesian Distance Clustering
- Neal, R. M. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics, 9(2), 249β265
- Neal, R. M. (2024). Modifying Gibbs Sampling to Avoid Self Transitions
- Jain, S., & Neal, R. M. (2004). A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model. Journal of Computational and Graphical Statistics, 13(1), 158β182
- Dahl, D. B. (2021). Sequentially-allocated merge-split samplers for conjugate Bayesian nonparametric models
- Luo, L., et al. (2018). Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS)
BNPClust is released under the GPL-3.0 license, which means:
- β Use freely: You can use BNPClust for any purpose, including commercial applications
- β Modify and extend: You can modify the code to suit your needs
- β Study the source: Full source code access for learning and verification
- π Share improvements: If you distribute modified versions, they must also be released under GPL-3.0 with source code provided
For details on GPL-3.0 compliance and your obligations, see the LICENSE file.
See CONTRIBUTING.md for guidelines on how to contribute to BNPClust.
If you find this project useful, please leave a star! β