Research

I work in the area of computational biology and bioinformatics (CBB). It is a fascinating field of research under rapid growth, with an aim of using computational methods to study biological and medical phenomena.

The very first reason for the need of CBB is the huge amount of experimental data pending analysis and interpretation. For example,

  1. Each haploid set of the human genome contains about 3 billion base pairs. If we have 1,000 such genomes, how should we store the data in order to save storage space and facilitate efficient access? How can we quickly identify all the differences between these long sequences? Among these differences, how do we pinpoint those that are associated with diseases? How can we tell their statistical and biological significance?
  2. Within the human genome, there are 20,000-25,000 protein coding genes. How can we predict their functions and interactions based on the many different types of data available? How can we tell if the predictions are correct? How can we use the predictions to advance biological and medical research?

As you can see, CBB research touches on many areas of computer science, including data mining, machine learning, database management and algorithm design. There are many evolving computational problems in CBB due to the emergence of new types of data with large data size, and the need of devising novel analysis tasks for various real-world applications. As a result, there is an urgent need for new research to solve these challenging problems.

Here are some of my recent projects:
Whole-genome identification of sequence elements

The genomes of human and many other organisms have been sequenced, yet understanding what the different parts of these DNA sequences do and their roles in the overall biological systems is still an ongoing endeavor. The ENCODE (Encyclopedia of DNA Elements) and modENCODE (ENCODE for model organisms) consortia have combined the efforts of many academic and research institutes worldwide with an aim to identify and characterize the functional elements in the genomes of human and model organisms such as fruit fly, mouse and worm. We have been analyzing data from these and other large-scale data sets. Specific projects include the identification of non-coding RNAs, transcription factor binding sites and enhancers, and the study of their functional roles.

Computational problems and techniques: machine learning (supervised, unsupervised and semi-supervised), feature selection, development of efficient algorithms for large datasets

Selected publications:

Computational modeling of gene regulation

Gene expression is a complicated process controlled by different functional components such as transcription factor binding, DNA methylation and histone modifications. Statistical modeling using high-throughput data can help investigate the relationships among various components at a global-scale. We have been using lastest large-scale datasets to construct these models with a goal of gaining new insights into gene regulation.

Computational problems and techniques: classification and regression, graph algorithms

Selected publications:

Studying of genetic diseases

It is now common to perform high-throughput experiments to compare the DNA, gene expression and other molecular signatures between disease samples and normal controls. These experiments produce large amounts of data that need computational methods to process, store, transfer, analyze and archive. We are working with multiple groups of collaborators who study different kinds of genetic diseases.

Computational problems and techniques: string matching and searching, database management, statistical analysis

Selected publications:

Annotation of genetic variants

Many genetic variants associated with particular phenotypes such as diseases are in non-coding regions. We have been developing methods and tools to help explore the potential functional roles of these variants in association with the phenotypes.

Computational problems and techniques: data integration, database management, parallel computation, software engineering

Selected publications:

Reconstruction of biological networks

Biological objects do not work alone, but rather interact with other objects to form complex networks. Cataloging these interactions is a first step to understand the functions of individual objects and the biological systems as a whole. Specific interactions have been identified and studied thoroughly, but high-throughput techniques for probing whole networks are still catching up with the high data quality required for in-depth analyses. We have been using computational methods to integrate different types of data with a goal of reconstructing the not yet fully known biological networks with high precision and coverage.

Computational problems and techniques: graph learning, data integration, kernel methods, probabilistic modeling, time-series analysis

Selected publications:

Tool developments
We have also developed some tools for the community to perform computational analysis. Our recent interest is developing tools for the efficient analysis of large-scale datasets, such as ``signal tracks'' produced by massively parallel sequencing.

Here are some of my older interests:
Prediction of functionally coupled objects through co-evolutionary analysis

Biological objects that are functionally coupled, such as protein domains that interact, are restrained from independent evolutionary events that prohibit them from normal interactions. Instead, they may undergo co-evolution, in which the fitness loss due to one of the evolutionary events is restored by the other event. Taking this idea in the reverse direction, by looking for objects that display co-evolutionary patterns, we could discover functionally coupled objects and advance our understanding of the corresponding biological pathways.

Computational problems and techniques: statistical modeling, information theory, correlation analysis

Selected publications:

Mining useful information from data with uncertainty

Many types of data contain certain uncertainty due to factors such as low measurement resolution, noise and staleness. The uncertainty could hinder or mislead the mining of useful information. Taking data uncertainty and related information such as repeated measurements into account in the mining process could help uncover hidden patterns. We have been designing algorithms to handle data uncertainty for a variety of data mining problems.

Computational problems and techniques: clustering, classification, pattern mining, data structures, data pruning

Selected publications: