Research Projects
Generating semi-structured gene
summaries from biomedical literature
We developed
Gene
Summarizer, the first that automatically generates gene summaries from
biomedical literature, to enable biomedical researchers to quickly digest
the literature. The Gene Summarizer is also integrated with the
BeeSpace informatics search and
navigation systems, and able to interactively summarize any collection.
Computational comparative genomics
We focused on efficiently identifying functionally coupled sequence
clusters in comparative genomics. The spatial clustering of genes across
different genomes has been used to study important problems in comparative
genomics, from identification of operons to detection of homologous regions.
The problem was formalized to compute max-gap clusters which is sufficiently
different and novel in comparison with existing pattern discovery problems
in data mining. We developed a pattern generation and verification approach
(MCPaGe) for pair-wise
genome comparison. We further extended the max-gap clusters model in
multiple-genome comparison and developed another efficient algorithm (MCMuSeC)
applying the idea of signature-based filtering.
Genome wide prediction of cis-regulatory
DNA motifs and modules
The activity of genes in a cell is regulated by other genes, and
complex networks of such regulatory interactions orchestrates a precise
pattern of expression of genes. We built and applied probabilistic models
and algorithms that integrate the heterogeneous and noisy sources of
genomics data, in order to deduce regulatory interactions of genes. We
performed genome-wide studies of cis-regulatory DNA motifs and modules in
multifarious organisms, from yeast to fruit fly and honeybee. Undergoing
work includes developing probabilistic models and algorithms for cis-regulatory
modules prediction by integrating heterogeneous data like DNA sequence,
microarray and biomedical literature etc.
Summarizing facets and opinions in
weblogs
We defined the problem of topic-sentiment analysis on Weblogs and
propose a novel probabilistic model to capture the mixture of topics and
sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model
can reveal the latent topical facets in a Weblog collection, the subtopics
in the results of an ad hoc query, and their associated sentiments and
dynamics. This model can also be applied to any text collections with a
mixture of topics and sentiments, thus has many potential applications, such
as search result summarization, opinion tracking, and user behavior
prediction.