Introduction
ChIPCor is a testing procedure for evaluating the significance of overlapping from a pair of proteins by leveraging information from publicly available data.
By borrowing information from protein binding profiles from public data repositories, the proposed method corrects for background artifacts involved in ChIP-seq data and the Simpson's paradox phenomenon due to genome heterogeneity.
Use ChIPCor
Download and installation
- Click following links to download the package:
- Follow the standard procedure to install the package in your R.
- ChIPCor depends on BioConductor package
qvalue
to turn p-values to q-values.
Prepare configuration file
The user needs to prepare an input list containing overlapping information of both the target protein pair for comparison
as well as the database of protein pairs from public data repositories under the same biological condition.
Each element contains the names of
a given pair of proteins and a list of two by two tables representing their overlapping at each segment type.
For preparing the two by two tables list for protein pairs under a given genome segmentation, the users can take the following steps:
- Segment the whole genome into bins of equal sizes (e.g. 1000 base pairs). For instance, given the bed file of ChromHMM segmentation for cell line K562, the genome can be segmented into 1000bp long bins:
#read in a vector recording the length of each chromosome for hg18
data(hg18.chrlen)
#segment the genome into 1000bp long bins
#annotate each bin according to ChromHMM states
genomeseg<-genomeSeg("wgEncodeBroadHmmK562HMM.bed",winsize=1000, chrlen)
- Similarly, for each protein, each bin can be annotated as 0 or 1 according to the absence or presence of its binding peaks using the function genomeSeg with the same segmentation length (e.g. 1000 base pairs).
- Once the peak lists for all the proteins have been converted to 0 or 1 vectors, we obtain a table where each row corresponds to a genomic bin and each column represents the genome-wide binding profile of a protein. The sample database table for K562 can be downloaded here.
- Then, the function prepareTbls constructs a two by two table to summarize the co-occurrence pattern of binding sites for each pair of proteins under each genome segmentation type. Each element of the function prepareTbls's output corresponds to a protein pair, and it records the co-occurrence patterns of the pair of protein over each segment type.
data(K562peaklocmat)
tbls=prepareTbls(peakloc.mat,genomeseg)
#data structure
str(tbls[[1]])
- Finally, the function ChIPCor can be applied to measure the spatial correlations of protein binding sites.
Examples
Using
ChromHMM to segment the genome, we applied our testing procedure
to transcription factors assayed in the
ENCODE project cell line K562:
data(K562tbls)
#inspection of the data structure
str(tbls[[1]])
#Conduct testing
qval.K562<-ChIPCor(tbls)
head(qval.K562)
and GM12878:
data(GM12878tbls)
qval.GM12878<-ChIPCor(tbls)
head(qval.GM12878)