Research |
I work in the area of computational biology and bioinformatics (CBB). It is a fascinating field of research under rapid growth, with an aim of using computational methods to study biological and medical phenomena. The very first reason for the need of CBB is the huge amount of experimental data pending analysis and interpretation. For example,
As you can see, CBB research touches on many areas of computer science, including data mining, machine learning, database management and algorithm design. There are many evolving computational problems in CBB due to the emergence of new types of data with large data size, and the need of devising novel analysis tasks for various real-world applications. As a result, there is an urgent need for new research to solve these challenging problems. |
Here are some of my recent projects: |
Whole-genome identification of sequence elementsThe genomes of human and many other organisms have been sequenced, yet understanding what the different parts of these DNA sequences do and their roles in the overall biological systems is still an ongoing endeavor. The ENCODE (Encyclopedia of DNA Elements) and modENCODE (ENCODE for model organisms) consortia have combined the efforts of many academic and research institutes worldwide with an aim to identify and characterize the functional elements in the genomes of human and model organisms such as fruit fly, mouse and worm. We have been analyzing data from these and other large-scale data sets. Specific projects include the identification of non-coding RNAs, transcription factor binding sites and enhancers, and the study of their functional roles. Computational problems and techniques: machine learning (supervised, unsupervised and semi-supervised), feature selection, development of efficient algorithms for large datasets Selected publications:
|
Computational modeling of gene regulationGene expression is a complicated process controlled by different functional components such as transcription factor binding, DNA methylation and histone modifications. Statistical modeling using high-throughput data can help investigate the relationships among various components at a global-scale. We have been using lastest large-scale datasets to construct these models with a goal of gaining new insights into gene regulation. Computational problems and techniques: classification and regression, graph algorithms Selected publications:
|
Studying of genetic diseasesIt is now common to perform high-throughput experiments to compare the DNA, gene expression and other molecular signatures between disease samples and normal controls. These experiments produce large amounts of data that need computational methods to process, store, transfer, analyze and archive. We are working with multiple groups of collaborators who study different kinds of genetic diseases. Computational problems and techniques: string matching and searching, database management, statistical analysis Selected publications:
|
Annotation of genetic variantsMany genetic variants associated with particular phenotypes such as diseases are in non-coding regions. We have been developing methods and tools to help explore the potential functional roles of these variants in association with the phenotypes. Computational problems and techniques: data integration, database management, parallel computation, software engineering Selected publications:
|
Reconstruction of biological networksBiological objects do not work alone, but rather interact with other objects to form complex networks. Cataloging these interactions is a first step to understand the functions of individual objects and the biological systems as a whole. Specific interactions have been identified and studied thoroughly, but high-throughput techniques for probing whole networks are still catching up with the high data quality required for in-depth analyses. We have been using computational methods to integrate different types of data with a goal of reconstructing the not yet fully known biological networks with high precision and coverage. Computational problems and techniques: graph learning, data integration, kernel methods, probabilistic modeling, time-series analysis Selected publications:
|
Tool developmentsWe have also developed some tools for the community to perform computational analysis. Our recent interest is developing tools for the efficient analysis of large-scale datasets, such as ``signal tracks'' produced by massively parallel sequencing.
Fok and Mok et al., ECplot: An Online Tool for Making Standardized Plots from Large Datasets for Bioinformatics Publications. Bioinformatics 30(10):1467-1468, (2014). ACT: For aggregation, correlation and saturation of signal tracks from genomic experiments. Jee, Rozowsky and Yip et al., ACT: Aggregation and Correlation Toolbox for Analyses of Genome Tracks. Bioinformatics 27(8):1152-1154, (2011). Coevolution: Studying the coevolution signal between residues of a protein. Yip and Patel et al., An Integrated System for Studying Residue Coevolution in Proteins. Bioinformatics 24(2):290-292, (2008). : Analyzing network topology. Yip et al., The tYNA Platform for Comparative Interactomics: A Web Tool for Managing, Comparing and Mining Multiple Networks. Bioinformatics 22(23):2968-2970, (2006). |
Here are some of my older interests: |
Prediction of functionally coupled objects through co-evolutionary analysisBiological objects that are functionally coupled, such as protein domains that interact, are restrained from independent evolutionary events that prohibit them from normal interactions. Instead, they may undergo co-evolution, in which the fitness loss due to one of the evolutionary events is restored by the other event. Taking this idea in the reverse direction, by looking for objects that display co-evolutionary patterns, we could discover functionally coupled objects and advance our understanding of the corresponding biological pathways. Computational problems and techniques: statistical modeling, information theory, correlation analysis Selected publications:
|
Mining useful information from data with uncertaintyMany types of data contain certain uncertainty due to factors such as low measurement resolution, noise and staleness. The uncertainty could hinder or mislead the mining of useful information. Taking data uncertainty and related information such as repeated measurements into account in the mining process could help uncover hidden patterns. We have been designing algorithms to handle data uncertainty for a variety of data mining problems. Computational problems and techniques: clustering, classification, pattern mining, data structures, data pruning Selected publications:
|