A joint finite mixture model for clustering genes from beta, Gaussian and Bernoulli distributed data
DAI, XIAOFENG (2009)
DAI, XIAOFENG
2009
Bioinformatiikka - Bioinformatics
Lääketieteellinen tiedekunta - Faculty of Medicine
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2009-12-31
Julkaisun pysyvä osoite on
https://urn.fi/urn:nbn:fi:uta-1-20284
https://urn.fi/urn:nbn:fi:uta-1-20284
Tiivistelmä
Background: Expression and protein-protein interaction data are often coupled in gene clustering, which has succeeded in many applications such as pathway discovery and function inference. However, asynchronous relations, which can be measured by protein-DNA binding data, are also crucial in regulatory network and should be taken into account. Thus, how to make efficient use of gene expression, protein-protein interaction and protein-DNA binding data has posed us a huge challenge.
Method: A beta-Gaussian-Bernoulli mixture model (BGBMM) is proposed to solve the aforementioned problem, assuming that protein-DNA binding probabilities, gene expression data and protein protein interactions can be modeled as beta, Gaussian and Bernoulli distributions, respectively. BGBMM is a natural extension of the beta mixture model, the Gaussian mixture model and the Bernoulli mixture model, which differs from other mixture model based methods by fusing three heterogeneous data types into a unified probabilistic modeling framework with each data type modeled as one component. BGBMM is demonstrated to be an efficient model for data fusion, and is applicable to any data sources that can be modeled as beta, Gaussian or Bernoulli distributed random variables. Further, it is easily extendable to data of any other parametric distributions in principle. A joint standard expectation maximization algorithm is developed to estimate parameters involved in BGBMM. Four approximation-based model selection methods, i.e., the Akaike information criterion, a modified AIC, the Bayesian information criterion, and the integrated classification likelihood-Bayesian information criterion, are compared in BGBMM, based on which the best performing ones are suggested for future use.
Results: The simulation tests and real case application show that combining three data sources into a single joint mixture model with each data type modeled as one component can highly improve the clustering accuracy. Also, applying BGBMM in mouse data reveals genes involved in the process of Toll-like receptor stimulated macrophage activation.
Asiasanat:Joint finite mixture model, gene clustering, gene expression, protein-DNA binding probability, protein protein interaction
Method: A beta-Gaussian-Bernoulli mixture model (BGBMM) is proposed to solve the aforementioned problem, assuming that protein-DNA binding probabilities, gene expression data and protein protein interactions can be modeled as beta, Gaussian and Bernoulli distributions, respectively. BGBMM is a natural extension of the beta mixture model, the Gaussian mixture model and the Bernoulli mixture model, which differs from other mixture model based methods by fusing three heterogeneous data types into a unified probabilistic modeling framework with each data type modeled as one component. BGBMM is demonstrated to be an efficient model for data fusion, and is applicable to any data sources that can be modeled as beta, Gaussian or Bernoulli distributed random variables. Further, it is easily extendable to data of any other parametric distributions in principle. A joint standard expectation maximization algorithm is developed to estimate parameters involved in BGBMM. Four approximation-based model selection methods, i.e., the Akaike information criterion, a modified AIC, the Bayesian information criterion, and the integrated classification likelihood-Bayesian information criterion, are compared in BGBMM, based on which the best performing ones are suggested for future use.
Results: The simulation tests and real case application show that combining three data sources into a single joint mixture model with each data type modeled as one component can highly improve the clustering accuracy. Also, applying BGBMM in mouse data reveals genes involved in the process of Toll-like receptor stimulated macrophage activation.
Asiasanat:Joint finite mixture model, gene clustering, gene expression, protein-DNA binding probability, protein protein interaction