Skip to content
/ LDTM Public

Longitudinal Dirichlet Tree Multinomial Model

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

liux4283/LDTM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TEST

LDTM

The goal of LDTM is to …

Installation

You can install the development version of LDTM from GitHub with:

# install.packages("devtools")
devtools::install_github("Goodgolden/LDTM")

Example

This is a basic example which shows you how to solve a common problem:

library(LDTM)
#> Loading required package: tidyverse
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> âś” ggplot2 3.3.5     âś” purrr   0.3.4
#> âś” tibble  3.1.6     âś” dplyr   1.0.8
#> âś” tidyr   1.2.0     âś” stringr 1.4.0
#> âś” readr   2.1.2     âś” forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> âś– dplyr::filter() masks stats::filter()
#> âś– dplyr::lag()    masks stats::lag()
#> Welcome to my package
## basic example code

Introduction

OTU operational taxonomic units

microbial genomic sequencess clustered by sequence similarity

partition sequences into discrete groups instead of traditional taxonomic units.

the most abundant sequence in an OTU is the representative sequence

representative sequences from all the OTUs are used to construct a phylogenetic tree among all the OTUs

Microbial community information == OTUs + counts + phylogenetic relationship + taxonomy

Human gut microbiome study in University of Pennsylvania

the effect of diet on gut microbiome composition

  • gut microbiome data
  • nutrient intake data

distance based analysis

  • identify a few gut microbiome associated nutrients

  • unable to provide information on how dietary nutrient affect bacterial taxa

Goal

identify both the key nutrients as well as the taxa the nutrient affect

Chen and Li (2013) adopted a regression-based approach

OTU abundance data as multivariate count responses, and nutrient as covaraite

Model

Multinomial logistic regression

Y = (Y_1, ..., Y_K)^T

y = (y_1, ..., y_K)^T

\sum_{k=1}^K Y_k = \sum_{k=1}^K y_k

p = (p_1, ..., p_K)^T, \sum_{k=1} ^K p_k =1

f_M(y;\ p) = \frac {\Gamma ({\sum_{k=1}^K y_k + 1})} {\prod_{k=1}^K \Gamma ({y_k + 1})} \prod_{k=1}^Kp_k^{y_k}

The link function is a multinomial-Poisson transformation

might need to use Poissonization to simulate the data

Overdisperson

  • the multinomial distribution is not appropriate

  • p is a random variable with some prior distribution

  • \Phi^d is a (d-1) dimensional simplex,

  • the support of p is \Phi^{K-1}

  • Dirichlet distribution density at u = (u_1, ..., u_K) \in \Phi^{K-1}

  • Dirichlet is a conjugate prior to multinomial distribution

  • posterior is Dirichlet multionmial distribution, aka the Dirichlet compound multinomial distribution.

a = (a_1, ..., a_K)^T,\ a_i > 0

f_D (u;a) = \frac {\Gamma ({\sum_{k=1}^K \alpha_k})} {\prod_{k=1}^K \Gamma ({\sum_{k=1}^K \alpha_k})} \prod_{k=1}^K u_k^{\alpha_k - 1} f_{DM}(y; \alpha) = \int_{u \in \Phi^{K-1}} f_M(y; u) f_D(u; \alpha)

= \frac {\Gamma ({\sum_{k=1}^K y_k + 1}) \Gamma ({\sum_{k=1}^K \alpha_k})} {\Gamma ({\sum_{k=1}^K y_k} + {\sum_{k=1}^K \alpha_k})} \prod_{k=1}^K \frac {\Gamma (y_k + \alpha_k)} {\Gamma ({y_k + 1}) \Gamma ({\alpha_k})}

Limitation

  • all components must share a common variance parameter

  • components are mutually independent, up to the constraint that must sum up to 1

  • distribution fails to take into account the special and inherent property of microbiome count data (evolutionary relationships in the phylogenetic tree)

Generalization of Dirichlet Multinomial distribution

  • the relationships among the components of the count vector can be represented as a tree, node-by-node.

    • each component has a independent variance

    • components are correlated at subtree levels

  • a regression model with the effects of covariates

  • a regularized methods for selecting covarites (nutrients) that are associated with the count responses (OTUs)

Logistic Normal Multinomial distribution

  • Billheimer with Aitchison’s logistic normal distribution instead of Dirichlet

    • covaraites to the count vector (Billheimer et al. 2001)
    • link dietary nutrients with bacteria counts (Xia et al. 2013)
    • cannot exploit the tree structure information, logistic normal multinomial does not have a closed-form expression.

Dirichlet-Tree Multinomial Distributions

  • total number of counts, determined by the sequence depth, as an ancillary statistics

    • analysis conditioning on this number
  • A Tree T representing the hierarchical structure over the count responses

    • we will incorporate the tree structural information into the modeling

    • \mathcal L the set of leaf nodes

    • \mathcal V the set of internal nodes of T

    • for each v \in \mathcal V, \mathcal C_v the child nodes of v

    • for each v and c \in \mathcal C_v, \delta_{vc} (l) = 1(v\rightarrow c, l \in \mathcal L)

  • \mathcal L = \{ 1, ..., K\}

  • y_{vc} = \sum_{l\in\mathcal L} \delta_{vc} (l) \ y_l

  • p_{vc} = \sum_{l\in \mathcal L} \delta_{vc} (l) \ p_l

  • c \in \mathcal C_v is the index for the subtree counts and probabilities

  • Let v_0 to be the root node of T.

  • each leaf node is (v_0 \equiv v_0^l )\rightarrow v_1^l \rightarrow ... \rightarrow v_{D_l -1}^l \rightarrow (v_{D_l}^l \equiv l)

    • the path from v_0 to l, where v_d^l \in \mathcal V, \forall d = 0, ..., D_{l-1}
    • D_l \geq 1 is the inner branches in this path
    • b_{vc} = \frac {p_{vc}} {\sum_{S \in \mathcal C_v} p_{vS}}, with the \sum_{c \in \mathcal C_v} b_{vc} = 1 that each branch as a probability (a conditional probability given we know where the branch started).

p_{l} = b_{v_0 v_1^l} \times b_{v_1^l v_2^l} \times ... \times b_{v_{D_l - 1}^l} = \prod_{v\in \mathcal V} \prod_{c \in \mathcal C} b_{vc}^{\delta^{vc}(l)}

  • for p_l is the product of branch probabilities from root v_0 to final leaf l

  • this is for the leaf level

f_M(y; p) = f_M(y; b_v, v\in \mathcal V) = \prod_{v \in \mathcal V} \frac {\Gamma (\sum_{c \in \mathcal C_v} y_{vc} + 1)} {\prod_{c \in \mathcal C_v} \Gamma ({y_{vc} + 1})} \prod_{c \in \mathcal C_v} b_{vc}^{y_vc}

b_v = \{b_{vc}, c\in \mathcal C_v\}

  • this is for the internal nodes b_v, v \in \mathcal V the joint density of (b_v, v \in \mathcal V) has the form of:

f_{DT} (u_v; \alpha_v) = \prod_{v\in \mathcal V} f_D(u_v; \alpha_v) = \prod_{v \in \mathcal V} \frac {\Gamma (\sum_{c \in \mathcal C_v} \alpha_{vc})} {\prod_{c \in \mathcal C_v} \Gamma ({\alpha_{vc}})} \prod_{c \in \mathcal C_v} u_{vc}^{\alpha_vc - 1}

  • u_v = (u_{vc}, c\in \mathcal C_v) \in \Phi^{K_v -1}

  • K is the number of children of v

  • \alpha_v = (\alpha_vc > 0, c \in \mathcal C_v)

f_{DTM} (y; \alpha_v, v\in\mathcal V)

= \prod_{v\in\mathcal V} \int_{u_v \in \Phi^{K_{v}-1}} f_M(y; u_v) f_{D}(u_v; \alpha_v) du_v

= \prod_{v\in\mathcal V} \frac {\Gamma ({\sum_{c\in \mathcal C_v} y_{vc} + 1}) \Gamma ({\sum_{c\in \mathcal C_v} \alpha_{vc}})} {\Gamma ({\sum_{c \in \mathcal C_v} y_{vc} } + {\sum_{c\in \mathcal C_v}\alpha_{vc}})} \prod_{c\in \mathcal C_v} \frac {\Gamma (y_{vc} + \alpha_{vc})} {\Gamma ({y_{vc} + 1}) \Gamma ({\alpha_{vc}})}

  • that each component in the product \prod_{v \in \mathcal V} corresponds to a interior node in the tree

  • the Dirichlet multinomial distribution based on the accumulated counts along branches of given node.

E[p_l] = \prod_{v\in\mathcal V} \prod_{c\in\mathcal C_v}\bigg(E[b_{vc}] \bigg)^{\delta_{vc}(l)} = \prod_{v\in\mathcal V} \prod_{c\in\mathcal C_v}\bigg( \frac {\alpha_{vc}} {\sum_{c\in \mathcal C_v} \alpha_{vc}} \bigg)^{\delta_{vc}(l)}

  • given \sum_{k=1}^K Y_k = \sum_{k=1}^K y_k; \ E[Y_l] = E[p_l] \sum_{k=1} ^Ky_k

    1. when \mathcal V \neq \{v_0\}, each p_l has an independent variance
    2. two components p_{v_i}, p_{v_j} with in the same subtree indexed by v \in \mathcal V are correlated, due to the same ancestor of v.

Dirichlet Tree Multinomial Regression Model

  • a regression of Y = (y_1, ..., y_K)^T on p covariates X_1, ... X_p

  • for each v \in \mathcal V and c \in \mathcal C_v we express \log(\alpha_{vc}) as a linear combination of the covariates.

    \log(\alpha_{vc}) = X^T \beta_{vc} \ \ \ \ (2.a)

    \log(\alpha_{(i)vc}) = X_i^T \beta_{(i)vc} + Z_i^T b_{(i){vc}}\ \ \ \ (2.b)

    X = (X_1, ..., X_p)^T

    \beta_{vc} = (\beta_{vc1}, ..., \beta_{vcp})^T

    y_i = (y_{i1}, ..., y_{iK})^T

    x_i = (x_{i1}, ..., x_{ip})^T

    b_{vc} = (b_{vc1}, ..., b_{vcq})^T

    z_i = (z_{i1}, ..., z_{iq})^T

    \beta_v = (\beta_{vc}, c \in \mathcal C_v)

    \beta = (\beta_v, v \in \mathcal V)

    \beta_{vi} | b_{vi} = (\beta_{(i)vc} | b_{(i)vc}, c \in \mathcal C_v)

    \beta_i | b_{i} = (\beta_{(i)v}| b_{(i)v}, v \in \mathcal V)

    l_{DTM}(\beta) = \log L_{DTM}(\beta) = \sum_{v \in \mathcal V} \log L_v(\beta_v) \ \ \ \ (3.a)

l_v(\beta_v) = log L_v(\beta_v) = \sum_{i=1}^n \Bigg[\tilde \Gamma \Big(\sum_{c\in \mathcal C_v} \alpha_{ivc}\Big) - \tilde \Gamma \Big(\sum_{c\in \mathcal C_v} y_{ivc} + \sum_{c\in \mathcal C_v} \alpha_{ivc}\Big) +c\sum_{c \in \mathcal C_v} \bigg\{ \tilde \Gamma \Big(y_{ivc} + \alpha_{ivc}\Big) - \tilde \Gamma \Big(\alpha_{ivc}\Big) \bigg\}\Bigg]

\tilde \Gamma (.) = \log \big(\Gamma (.) \big)

\alpha_{ivc} = \exp(x_i^T\beta_{vc})

y_{ivc} = \sum_{l \in \mathcal L} \delta_{vc} (l) y_{il}

l_v(\beta_v) = \log L_v(\beta_v) = \sum_{i=1}^n \Bigg[\log \bigg(\Gamma \Big(\sum_{c\in \mathcal C_v} \exp(x_i^T\beta_{vc})\Big) \bigg) - \log \bigg(\Gamma \Big(\sum_{c\in \mathcal C_v} y_{ivc} + \sum_{c\in \mathcal C_v} \exp(x_i^T\beta_{vc})\Big) \bigg) +\sum_{c \in \mathcal C_v} \bigg\{ \log \bigg(\Gamma \Big(y_{ivc} + \exp(x_i^T\beta_{vc})\Big)\bigg) -\log \bigg(\Gamma \Big(\exp(x_i^T\beta_{vc})\Big) \bigg)\bigg\}\Bigg]

Regularized Likelihood Estimation

  • parametrization for Dirichlet Tree Multinomial Regression Model

  • the number of p time the number of tree branches \sum_{v\in \mathcal V} K_v

pnl_{DTM}(\beta; \lambda, \gamma) = -l_{DTM}(\beta) + \lambda \bigg\{ (1- \gamma)\sum_{v \in \mathcal V} \sum_{c \in \mathcal C_v}\|\beta_{cv}\|_{L1} + \gamma \sum_{v \in \mathcal V} \sum_{c \in \mathcal C_v}\|\beta_{cv}\|_{L2} \bigg\}

Algorithm of accelerated proximal gradient method

  • approximate l_{DTM} (\beta) at \beta^{(t)}

    - \tilde l_{DTM}(\beta; \eta^{(t)}) = -l_{DTM}(\eta^{(t)}) - \langle\beta - \eta^{(t)}, \nabla l_{DTM} (\eta^{(t)})\rangle + \frac {C} {2} \|\beta - \eta^{(t)}\|^2_2 \ \ \ \ (4)

    pn \tilde{l}_{DMT}(\beta) = - \tilde l_{DTM} (\beta; \eta^{(t)}) + \lambda_1 \sum_{v \in \mathcal V} \sum_{c \in \mathcal C_v}\|\beta_{cv}\|_{1} + \lambda_2\sum_{v \in \mathcal V} \sum_{c \in \mathcal C_v}\|\beta_{cv}\|_{2} \ \ \ \ (5)

    \beta^{(t+1)} = \underset {\beta} {arg min} \big(pn \tilde{l}_{DTM} (\beta)\big) \ \ \ \ (6)

    \eta^{(t+1)} = \beta^{(t+1)} + \frac {1-\alpha_t} {\alpha_t} \alpha_{t+1} (\beta^{(t+1)} - \beta^{(t)}) \ \ \ \ (7)

    \alpha_t = 2/(t+2)

    \beta_{vc}^{(t+1)} = \underset {\beta_{vc}} {argmin} \Bigg[ \frac 1 2 \bigg \|\beta_{vc} - \Big\{\eta_{vc}^{(t)} - \frac 1 C \nabla l_{vc} (\eta_v^{(t)}) \Big\}\bigg\|_2^2 + \frac {\lambda_1} {C}\|\beta_{vc}\|_1 + \frac {\lambda_2} {C} \|\beta_{vc}\|_2 \Bigg]

    \text {s.t.} \ \ \beta_{vc}^{(t+1)} = \underset {\beta_{vc}} {argmin} \Bigg[ \frac 1 2 \bigg \| \beta_{vc} +\frac 1 C \nabla l_{vc} (0) \bigg\|_2^2 + \lambda \bigg(\frac {1-\gamma} {C}\|\beta_{vc}\|_1 + \frac {\gamma} {C} \|\beta_{vc}\|_2\bigg) \Bigg]

    \tilde \beta_{vc}^{(t+1)} = sgn \bigg\{ \eta_{vc}^{(t)} - \frac 1 C \nabla l_{vc} (\eta_{vc}^{(t)})\bigg\} \ \max \bigg\{0, \Big|\eta_{vc}^{(t)} - \frac 1 C \nabla l_{vc}(\eta_v^{(t)}) \Big| - \frac {\lambda_1} {C}\bigg\}

    \beta_{vc}^{(t+1)} = \frac {\tilde \beta_{vc}^{(t+1)}}{\|\tilde \beta_{vc}^{(t+1)}\|_2} \max \bigg\{0, \Big\|\tilde \beta_{vc}^{(t+1)} \Big\|_2 - \frac {\lambda_1} {C}\bigg\} \ \ \ \ (8)

About

Longitudinal Dirichlet Tree Multinomial Model

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •