This repository contains a python library used to implement methodologies and reproduce results in Sanna Passino, F. and Heard, N. A. (2022), Latent structure blockmodels for Bayesian spectral graph clustering, Statistics and Computing 32(2) (link, arXiv preprint).
The python library can be installed (in edit mode) as follows:
pip install -e lib/
The library can then be imported in any python session:
import lsbmThe repository contains multiple directories:
libcontains the python library;datacontains some of the datasets used in the paper;notebookscontains Jupyter notebooks with examples on how to use the library;scriptscontains python scripts to reproduce the results in the paper.
The main tool for inference on LSBMs is the MCMC sampler lsbm_gp_gibbs contained in the file lib/lsbm/lsbm_gp.py. The class can be initialised using three objects:
- a
-dimensional embedding
X, anumpyarray, whereis the number of nodes;
- the number of communities
K, a positive integer; - the set of kernels
csi, a dictionary containingkernel functions
for the Gaussian processes.
Additionally, the file lib/lsbm/lsbm.py contains a simplified class lsbm_gibbs which can be used for models where the kernel functions are assumed to be inner products. The latent functions in such models can be expressed in the form , for basis functions
and corresponding weights
with joint normal-inverse-gamma prior with the variance parameter:
The class lsbm does not require a dictionary of kernel functions for initialisation, but a dictionary W_function containing the basis functions. The prior parameters of the NIG prior are then specified using the function initialisation within the same class. The class uses the Zellner's -prior as default.
For example, consider the model where all the community-specific curves are assumed to be quadratic in , passing through the origin, and linear in the first dimension (with zero intercept and unit slope). Initialising the Gibbs sampler class in
lsbm is easy, since the basis functions are straightforward to define:
phi = {}
for k in range(K):
phi[k,0] = lambda theta: np.array([theta])
for j in range(1,d):
phi[k,j] = lambda theta: np.array([theta ** 2, theta])
m = lsbm.lsbm_gibbs(X=X, K=K, W_function=phi)Defining the kernels requires a bit more effort. Using the dictionary phi defined in the previous code snippet, the corresponding kernels - assuming prior scale matrices Delta for the NIG prior - are:
csi = {}
for k in range(K):
csi[k,0] = lambda theta,theta_prime: Delta[k,0] * np.matmul(phi[k,0](theta).reshape(-1,1),phi[k,0](theta_prime).reshape(1,-1))
for j in range(1,d):
csi[k,j] = lambda theta,theta_prime: np.matmul(np.matmul(phi[k,j](theta),Delta[k,j]),np.transpose(phi[k,j](theta_prime)))
m = lsbm_gp.lsbm_gp_gibbs(X=X, K=K, csi=csi)Note that the kernel functions in csi must be flexible enough to handle correctly four types of input - (float,float), (float,vector), (vector,float), (vector,vector) -, returning appropriate objects - float, vector or matrix - in each of the four circumstances.
Then the remaining parameters in both classes are initialised using the function initialise. For example, z could be initialised using -means. If the first dimension is assumed to be linear in
, then
could be initialised using a noisy version the first column of the embedding. For the same reason, the mean
of the prior on
could just be assumed to be the mean of the first column of the embedding, with large variance
. Finally, the parameters of the inverse-gamma prior on the variance component must be chosen carefully: the scale
b_0 needs to be small to avoid unwanted strong prior effects.
m.initialise(z=np.random.choice(m.K,size=m.n), theta=X[:,0]+np.random.normal(size=m.n,scale=0.01),
mu_theta=X[:,0].mean(), sigma_theta=10, a_0=1, b_0=0.01, first_linear=True)In the class lsbm_gibbs, the function initialise requires an additional parameter Lambda_0, representing the scale parameter for the Zeller's prior scale matrix .
Finally, the MCMC sampler could be run, specifying the number of samples samples, the burn-in length burn, the variance of the proposal distribution for
, denoted
sigma_prop, and the integer parameter thinning. The MCMC sampler retains only samples indexed by multiples of thinning.
q = m.mcmc(samples=10000, burn=1000, sigma_prop=0.5, thinning=1)The figures in the paper could be reproduced using the following files in the directory scripts:
- Figure 1:
sim_pictures_intro.py; - Figure 3:
sim_pictures.py; - Figures 4 and 5:
sim.py; - Figure 6:
harry_potter.py; - Figure 7 and 8:
drosophila.py.
For security reasons, the ICL network data have not been made available, but the code to run the quadratic model and truncated power basis spline model are available in scripts/icl.py. Similarly, scripts/icl_gp.py reproduces the same models under an explicit Gaussian process setting.