ChIPpeaksAnno
ChIPpeaksAnnot is a Bioconductor package, built on top of all the Biostrings, BSGenome, IRanges stuff, that will let you annotate peaks with nearby features. Some features (eg. mouse and human TSS positions for current genomes) are made available as data with the package, but it also speaks biomaRt to fetch other features if you need them. It’s pretty fast – it annotated 35K peaks with their nearest Ensembl gene in seconds.
Quick Example:
#load the mouse mm9 TSS positions
library(ChIPpeakAnno)
data(TSS.mouse.NCBIM37)
#or you could grab them from biomaRt if you want to:
library(biomaRt)
ensmart<-useMart("ensembl", dataset="mmusculus_genes_ensembl")
TSS.mouse.NCBIM37 <- getAnnotation(mart=ensmart, featureType="TSS")
#The annotation from ChIPpeakAnno has chromosome names like "1", "2", "X", so if you have
#"chr1", remove the "chr"
peaks$Chr = gsub('chr','',peaks$Chr)
#make your peak data into a RangedData object
peak.ranges<-RangedData(
ranges= IRanges(
start=peaks$GenomeStart ,
end=peaks$GenomeEnd ),
names = paste("H3K9me3_peak", seq(1:nrow(peaks)), sep=""),
space = peaks$Chr
)
#and annotate your peaks with the nearest TSS data:
annotatedPeak = annotatePeakInBatch(peak.ranges[1:5,], AnnotationData = TSS.mouse.NCBIM37)
as.data.frame(annotatedPeak)
# space start end width names strand feature
#1 1 108068801 108070959 2159 peak4 1 ENSMUSG00000044340
#2 1 176431793 176434091 2299 peak5 1 ENSMUSG00000028354
#3 1 180265735 180267970 2236 peak1 -1 ENSMUSG00000039630
#4 1 132611692 132615119 3428 peak2 -1 ENSMUSG00000026409
#5 1 122014653 122018516 3864 peak3 -1 ENSMUSG00000026385
# start_position end_position insideFeature distancetoFeature
#1 108068581 108290821 TRUE 220
#2 176431956 176752198 FALSE -163
#3 180258431 180267915 TRUE 2180
#4 132585759 132612391 TRUE 699
#5 122009867 122017594 TRUE 2941
The annotatePeakInBatch function throws a wobbly if your peaks don’t have rownames. Something like the following will fix it:
*** annotatePeakInBatch.R.old 2010-01-14 12:36:00.000000000 +0000
--- annotatePeakInBatch.R 2010-01-14 12:37:47.000000000 +0000
***************
*** 26,31 ****
--- 26,34 ----
"strand")
allChr.Anno = unique(space(TSS.ordered))
numberOfChromosome = length(unique(space(myPeakList)))
+ if(is.null(rownames(myPeakList))){
+ rownames(myPeakList)<-paste("peak", seq(1,nrow(myPeakList)),
sep="")
+ }
z1 = cbind(as.character(rownames(myPeakList)),
as.character(space(myPeakList)),
start(myPeakList), end(myPeakList))
colnames(z1) = c("name", "chr", "peakStart", "peakEnd")