Home of Kevin Yip

Research

I work in the area of computational biology and bioinformatics (CBB). It is a fascinating field of research under rapid growth, with an aim of using computational methods to study biological and medical phenomena.

The very first reason for the need of CBB is the huge amount of experimental data pending analysis and interpretation. For example,

Each haploid set of the human genome contains about 3 billion base pairs. If we have 1,000 such genomes, how should we store the data in order to save storage space and facilitate efficient access? How can we quickly identify all the differences between these long sequences? Among these differences, how do we pinpoint those that are associated with diseases? How can we tell their statistical and biological significance?
Within the human genome, there are 20,000-25,000 protein coding genes. How can we predict their functions and interactions based on the many different types of data available? How can we tell if the predictions are correct? How can we use the predictions to advance biological and medical research?

As you can see, CBB research touches on many areas of computer science, including data mining, machine learning, database management and algorithm design. There are many evolving computational problems in CBB due to the emergence of new types of data with large data size, and the need of devising novel analysis tasks for various real-world applications. As a result, there is an urgent need for new research to solve these challenging problems.

Here are some of my recent projects:

Whole-genome identification of sequence elements

The genomes of human and many other organisms have been sequenced, yet understanding what the different parts of these DNA sequences do and their roles in the overall biological systems is still an ongoing endeavor. The ENCODE (Encyclopedia of DNA Elements) and modENCODE (ENCODE for model organisms) consortia have combined the efforts of many academic and research institutes worldwide with an aim to identify and characterize the functional elements in the genomes of human and model organisms such as fruit fly, mouse and worm. We have been analyzing data from these and other large-scale data sets. Specific projects include the identification of non-coding RNAs, transcription factor binding sites and enhancers, and the study of their functional roles.

Computational problems and techniques: machine learning (supervised, unsupervised and semi-supervised), feature selection, development of efficient algorithms for large datasets

Selected publications:

Hu et al., A Common Set of Distinct Features that Characterize Noncoding RNAs across Multiple Species. Nucleic Acids Research 43(1):104-114, (2015).
Yip et al., Machine Learning and Genome Annotation: A Match Meant to be?. Genome Biology 14(5):205, (2013).
The ENCODE Project Consortium, An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature 489(7414):57-74, (2012).
Yip et al., Classification of Human Genomic Regions based on Experimentally-determined Binding Sites of More Than 100 Transcription-related Factors. Genome Biology 13(9):R48, (2012).
Lu and Yip et al., Prediction and Characterization of Non-coding RNAs in C. elegans by Integrating Conservation, Secondary Structure and High Throughput Sequencing and Array Data. Genome Research 21(2):276-285, (2011).
The modENCODE consortium, Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project. Science 330(6012):1775-1787, (2010).

Computational modeling of gene regulation

Gene expression is a complicated process controlled by different functional components such as transcription factor binding, DNA methylation and histone modifications. Statistical modeling using high-throughput data can help investigate the relationships among various components at a global-scale. We have been using lastest large-scale datasets to construct these models with a goal of gaining new insights into gene regulation.

Computational problems and techniques: classification and regression, graph algorithms

Selected publications:

Lou et al., Whole-Genome Bisulfite Sequencing of Multiple Individuals Reveals Complementary Roles of Promoter and Gene Body Methylation in Transcriptional Regulation. Genome Biology 15(7):408, (2014).
Cheng et al., Understanding Transcriptional Regulation by Integrative Analysis of Transcription Factor Binding Data. Genome Research 22(9):1658-1667, (2012).
Cheng et al., Genome-wide Analysis of Chromatin Features Identifies Chromatin-Sensitive and Chromatin-Insensitive Classes of Yeast Transcription Factors. Genome Biology 12(11):R111, (2011).
Cheng et al., A Statistical Framework for Modeling Gene Expression using Chromatin Features with Application to modENCODE Datasets. Genome Biology, 12(2):R15, (2011).

Studying of genetic diseases

It is now common to perform high-throughput experiments to compare the DNA, gene expression and other molecular signatures between disease samples and normal controls. These experiments produce large amounts of data that need computational methods to process, store, transfer, analyze and archive. We are working with multiple groups of collaborators who study different kinds of genetic diseases.

Computational problems and techniques: string matching and searching, database management, statistical analysis

Selected publications:

Tso et al., Are Special Read Alignment Strategies Necessary and Cost-Effective when Handling Sequencing Reads from Patient-Derived Tumor Xenografts? BMC Genomics 15:1172, (2014).
Chung et al., Identification of a Recurrent Transforming UBR5-ZNF423 Fusion Gene in EBV-Associated Nasopharyngeal Carcinoma. Journal of Pathology 231(2):158-167, (2013).
Tso et al., Complete Genomic Sequence of Epstein-Barr Virus in Nasopharyngeal Carcinoma Cell Line C666-1. Infectious Agents and Cancer 8(1):29, (2013).
Zhao, Sui et al., Sustained Antidiabetic Effects of a Berberine-containing Chinese Herbal Medicine through Regulation of Hepatic Gene Expression. Diabetes 61(4):933-943, (2012).

Annotation of genetic variants

Many genetic variants associated with particular phenotypes such as diseases are in non-coding regions. We have been developing methods and tools to help explore the potential functional roles of these variants in association with the phenotypes.

Computational problems and techniques: data integration, database management, parallel computation, software engineering

Selected publications:

Fu et al., FunSeq2: A Framework for Prioritizing Noncoding Regulatory Variants in Cancer. Genome Biology 15(10):480, (2014).
Ho et al., VAS: A Convenient Web Portal for Efficient Integration of Genomic Features with Millions of Genetic Variants. BMC Genomics 15:886, (2014).

Reconstruction of biological networks

Biological objects do not work alone, but rather interact with other objects to form complex networks. Cataloging these interactions is a first step to understand the functions of individual objects and the biological systems as a whole. Specific interactions have been identified and studied thoroughly, but high-throughput techniques for probing whole networks are still catching up with the high data quality required for in-depth analyses. We have been using computational methods to integrate different types of data with a goal of reconstructing the not yet fully known biological networks with high precision and coverage.

Computational problems and techniques: graph learning, data integration, kernel methods, probabilistic modeling, time-series analysis

Selected publications:

Hu et al., Computational Identification of Protein Binding Sites on RNAs using High-Throughput RNA Structure-Probing Data. Bioinformatics 30(8):1049-1055, (2014).
Yip and Yip, Systematic Exploration of Autonomous Modules in Noisy MicroRNA-Target Networks for Testing the Generality of the ceRNA Hypothesis. BMC Genomics 15:1178, (2014).
Gerstein, Kundaje, Hariharan, Landt, Yan, Cheng, Mu, Khurana, Rozowsky, Alexander, Min, Alves et al., Architecture of the Human Regulatory Network Derived from ENCODE Data. Nature 489(7414):91-100, (2012).
Li et al., Extensive in vivo Metabolite-Protein Interactions Revealed by Large-Scale Systematic Analyses. Cell 143(4):639-650, (2010).
Yip et al., Improved Reconstruction of in silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data. PLoS ONE 5(1):e8121, (2010).
Yip et al., Training Set Expansion: An Approach to Improving the Reconstruction of Biological Networks from Limited and Uneven Reliable Interactions. Bioinformatics 25(2):243-250, (2009).

Tool developments

We have also developed some tools for the community to perform computational analysis. Our recent interest is developing tools for the efficient analysis of large-scale datasets, such as ``signal tracks'' produced by massively parallel sequencing.

ECplot: For making reproducible plots from large data sets.
Fok and Mok et al., ECplot: An Online Tool for Making Standardized Plots from Large Datasets for Bioinformatics Publications. Bioinformatics 30(10):1467-1468, (2014).
ACT: For aggregation, correlation and saturation of signal tracks from genomic experiments.
Jee, Rozowsky and Yip et al., ACT: Aggregation and Correlation Toolbox for Analyses of Genome Tracks. Bioinformatics 27(8):1152-1154, (2011).
Coevolution: Studying the coevolution signal between residues of a protein.
Yip and Patel et al., An Integrated System for Studying Residue Coevolution in Proteins. Bioinformatics 24(2):290-292, (2008).
: Analyzing network topology.
Yip et al., The tYNA Platform for Comparative Interactomics: A Web Tool for Managing, Comparing and Mining Multiple Networks. Bioinformatics 22(23):2968-2970, (2006).

Here are some of my older interests:

Prediction of functionally coupled objects through co-evolutionary analysis

Biological objects that are functionally coupled, such as protein domains that interact, are restrained from independent evolutionary events that prohibit them from normal interactions. Instead, they may undergo co-evolution, in which the fitness loss due to one of the evolutionary events is restored by the other event. Taking this idea in the reverse direction, by looking for objects that display co-evolutionary patterns, we could discover functionally coupled objects and advance our understanding of the corresponding biological pathways.

Computational problems and techniques: statistical modeling, information theory, correlation analysis

Selected publications:

Chen et al., Identification of a Major Determinant for Serine-Threonine Kinase Phosphoacceptor Specificity. Molecular Cell 53(1):140-147, (2014).
Yip and Utz et al., Identification of Specificity Determining Residues in Peptide Recognition Domains using an Information Theoretic Approach Applied to Large-Scale Binding Maps. BMC Biology 9:53, (2011).
Yip and Patel et al., An Integrated System for Studying Residue Coevolution in Proteins. Bioinformatics 24(2):290-292, (2008).

Mining useful information from data with uncertainty

Many types of data contain certain uncertainty due to factors such as low measurement resolution, noise and staleness. The uncertainty could hinder or mislead the mining of useful information. Taking data uncertainty and related information such as repeated measurements into account in the mining process could help uncover hidden patterns. We have been designing algorithms to handle data uncertainty for a variety of data mining problems.

Computational problems and techniques: clustering, classification, pattern mining, data structures, data pruning

Selected publications:

Yip et al., Mining Order-Preserving Submatrices from Data with Repeated Measurements. IEEE Transactions on Knowledge and Data Engineering (TKDE) 25(7):1587-1600, (2013).
Ngai et al., Metric and Trigonometric Pruning for Clustering of Uncertain Data in 2D Geometric Space. Information Systems 36(2):476-497, (2011).
Tsang et al., Decision Trees for Uncertain Data. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23(1):64-78, (2011).