-
Notifications
You must be signed in to change notification settings - Fork 0
Genomicfeaturesandranges
This page will show what you need to do to create genomic features databases and genomic ranges.
One purpose is to find overlaps between the cufflink transcripts created by Laura Saba and pre-existing transcripts defined by ensembl etc.
- biomaRt
- GenomicFeatures
- GenomicRanges
- rtracklayer
I have found the easiest way is, within R, to issue statements like this:
source("http://bioconductor.org/biocLite.R")
biocLite(GenomicFeatures)
If I use the R package installer, I have found that the installation sometimes hangs.
The genomic features module allows us to create sqlite transcript databases from Biomart, UCSC or user provided data. From these transcript databases, we can find overlapping transcripts easily.
To see available genomes at UCSC, you require rtracklayer.
ucscGenomes()[,"db"] [1] "hg19" "hg18" "hg17" "hg16" "vicPac1" "dasNov3" "otoGar3" "papHam1" "felCat5" "felCat4" "felCat3" [12] "panTro4" "panTro3" "panTro2" "panTro1" "bosTau7" "bosTau6" "bosTau4" "bosTau3" "bosTau2" "canFam3" "canFam2" [23] "canFam1" "turTru2" "loxAfr3" "nomLeu2" "nomLeu1" "gorGor3" "cavPor3" "eriEur1" "equCab2" "equCab1" "dipOrd1" [34] "triMan1" "calJac3" "calJac1" "pteVam1" "myoLuc2" "mm10" "mm9" "mm8" "mm7" "micMur1" "hetGla2" [45] "hetGla1" "monDom5" "monDom4" "monDom1" "ponAbe2" "ailMel1" "susScr3" "susScr2" "ochPri2" "ornAna1" "oryCun2" [56] "rn5" "rn4" "rn3" "rheMac3" "rheMac2" "proCap1" "oviAri1" "sorAra1" "choHof1" "speTri2" "saiBol1" [67] "sorAra1" "sarHar1" "echTel1" "tupBel1" "macEug2" "gadMor1" "melUnd1" "galGal4" "galGal3" "galGal2" "latCha1" [78] "fr3" "fr2" "fr1" "petMar2" "petMar1" "anoCar2" "anoCar1" "oryLat2" "geoFor1" "oreNil2" "chrPic1" [89] "gasAcu1" "tetNig2" "tetNig1" "melGal1" "xenTro3" "xenTro2" "xenTro1" "taeGut1" "danRer7" "danRer6" "danRer5" [100] "danRer4" "danRer3" "ci2" "ci1" "braFlo1" "strPur2" "strPur1" "apiMel2" "apiMel1" "anoGam1" "droAna2" [111] "droAna1" "droEre1" "droGri1" "dm3" "dm2" "dm1" "droMoj2" "droMoj1" "droPer1" "dp3" "dp2" [122] "droSec1" "droSim1" "droVir2" "droVir1" "droYak2" "droYak1" "caePb2" "caePb1" "cb3" "cb1" "ce10" [133] "ce6" "ce4" "ce2" "caeJap1" "caeRem3" "caeRem2" "priPac1" "aplCal1" "sacCer3" "sacCer2" "sacCer1"
To see all the possible tables at UCSC, do this:
> supportedUCSCtables()
track subtrack
knownGene UCSC Genes <NA>
knownGeneOld3 Old UCSC Genes <NA>
wgEncodeGencodeManualV3 Gencode Genes Genecode Manual
wgEncodeGencodeAutoV3 Gencode Genes Genecode Auto
wgEncodeGencodePolyaV3 Gencode Genes Genecode PolyA
ccdsGene CCDS <NA>
refGene RefSeq Genes <NA>
xenoRefGene Other RefSeq <NA>
vegaGene Vega Genes Vega Protein Genes
vegaPseudoGene Vega Genes Vega Pseudogenes
ensGene Ensembl Genes <NA>
acembly AceView Genes <NA>
sibGene SIB Genes <NA>
nscanPasaGene N-SCAN N-SCAN PASA-EST
nscanGene N-SCAN N-SCAN
sgdGene SGD Genes <NA>
sgpGene SGP Genes <NA>
geneid Geneid Genes <NA>
genscan Genscan Genes <NA>
exoniphy Exoniphy <NA>
augustusHints Augustus Augustus Hints
augustusXRA Augustus Augustus De Novo
augustusAbinitio Augustus Augustus Ab Initio
acescan ACEScan <NA>
lincRNAsTranscripts lincRNAsTranscripts <NA>
Not all tables are available for all genomes.
rn5RefGeneUCSC <- makeTranscriptDbFromUCSC(genome="rn5",tablename="refGene") Download the refGene table ... OK Download the refLink table ... OK Extract the 'transcripts' data frame ... OK Extract the 'splicings' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Prepare the 'metadata' data frame ... metadata: OK Make the TranscriptDb object ... OK There were 50 or more warnings (use warnings() to see the first 50)
The errors seem to have to do with CDS:
> warnings() Warning messages: 1: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]], ... : UCSC data anomaly in transcript NM_031007: the cds cumulative length is not a multiple of 3 2: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]], ... : UCSC data anomaly in transcript NM_032071: the cds cumulative length is not a multiple of 3
Here you can see information about chromosomes:
seqinfo(rn5RefGeneUCSC) Seqinfo of length 2739 seqnames seqlengths isCircular genome chr1 290094216 FALSE rn5 chr2 285068071 FALSE rn5 chr3 183740530 FALSE rn5 chr4 248343840 FALSE rn5 chr5 177180328 FALSE rn5 chr6 156897508 FALSE rn5 chr7 143501887 FALSE rn5 chr8 132457389 FALSE rn5 chr9 121549591 FALSE rn5 ... ... ... ... chrUn_JH620690 1647 FALSE rn5 chrUn_JH620691 1172 FALSE rn5 chrUn_JH620692 2686 FALSE rn5 chrUn_JH620693 9171 FALSE rn5 chrUn_JH620694 6347 FALSE rn5 chrUn_JH620695 1669 FALSE rn5 chrUn_JH620696 7236 FALSE rn5 chrUn_JH620697 3488 FALSE rn5 chrUn_JH620698 3129 FALSE rn5
To get the latest version of ensemble, you need to point the host to www.ensembl.org or uswest.ensembl.org.
You can see which marts are available by using listMarts from the biomaRt package.
listMarts(host="uswest.ensembl.org")
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 70
2 ENSEMBL_MART_SNP Ensembl Variation 70
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 70
4 ENSEMBL_MART_VEGA Vega 50
5 REACTOME REACTOME (CSHL US)
6 pride PRIDE (EBI UK)
Then you can list the available datasets:
mart<-useMart("ENSEMBL_MART_ENSEMBL",host="uswest.ensembl.org")
listDatasets(mart=mart)
Among the results you will see:
12 rnorvegicus_gene_ensembl Rattus norvegicus genes (Rnor_5.0) Rnor_5.0
Finally, create the transcript database:
bioMartEnsemblRn5 <- makeTranscriptDbFromBiomart(
biomart="ENSEMBL_MART_ENSEMBL",
dataset='rnorvegicus_gene_ensembl',
host="uswest.ensembl.org")
You can now see many properties of the transcript database:
> bioMartEnsemblRn5 TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: BioMart | Genus and Species: Rattus norvegicus | Resource URL: uswest.ensembl.org:80 | BioMart database: ENSEMBL_MART_ENSEMBL | BioMart database version: Ensembl Genes 70 | BioMart dataset: rnorvegicus_gene_ensembl | BioMart dataset description: Rattus norvegicus genes (Rnor_5.0) | BioMart dataset version: Rnor_5.0 | Full dataset: yes | miRBase build ID: NA | transcript_nrow: 29189 | exon_nrow: 216691 | cds_nrow: 208375 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2013-01-21 10:34:48 -0700 (Mon, 21 Jan 2013) | GenomicFeatures version at creation time: 1.10.1 | RSQLite version at creation time: 0.11.2 | DBSCHEMAVERSION: 1.0
Here is information about the chromosomes:
> seqinfo(bioMartEnsemblRn5) Seqinfo of length 95 seqnames seqlengths isCircular genome 1 <NA> <NA> <NA> 2 <NA> <NA> <NA> 3 <NA> <NA> <NA> 4 <NA> <NA> <NA> 5 <NA> <NA> <NA> 6 <NA> <NA> <NA> 7 <NA> <NA> <NA> 8 <NA> <NA> <NA> 9 <NA> <NA> <NA> ... ... ... ... JH620497.1 <NA> <NA> <NA> JH620498.1 <NA> <NA> <NA> JH620499.1 <NA> <NA> <NA> JH620501.1 <NA> <NA> <NA> JH620511.1 <NA> <NA> <NA> JH620517.1 <NA> <NA> <NA> JH620530.1 <NA> <NA> <NA> JH620566.1 <NA> <NA> <NA> JH620632.1 <NA> <NA> <NA>
Define the sqlite file name:
bioSqliteFile <- '/Users/walkerhound/Documents/TranscriptDatabases/bioMartEnsemblRn5.sqlite' rn5RefGeneUCSCFile <- '/Users/walkerhound/Documents/TranscriptDatabases/rn5RefGeneUCSC.sqlite'
Now save the database. Use saveDb from AnnotationDb instead of saveFeatures from GenomicFeatures because saveFeatures has been deprecated.
saveDb(bioMartEnsemblRn5,bioSqliteFile) saveDb(rn5RefGeneUCSC,rn5RefGeneUCSCFile)
This creates a genomic Range:
bioMartTranscripts <- transcripts(bioMartEnsemblRn5)
Here is an entry from the preceding genomic range:
bioMartTranscripts[10000:10000]
GRanges with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] 15 [62345358, 62348598] + | 23376 ENSRNOT00000001360
---
seqlengths:
1 2 3 4 ... JH620517.1 JH620530.1 JH620566.1 JH620632.1
NA NA NA NA ... NA NA NA NA
This creates a genomic Range List:
bioMartExonsByTranscript <- exonsBy(bioMartEnsemblRn5,by="tx")
Here is an entry from the preceding genomic range list:
bioMartExonsByTranscript[15000:15000]
GRangesList of length 1:
$15000
GRanges with 7 ranges and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
<Rle> <IRanges> <Rle> | <integer> <character> <integer>
[1] 8 [62813283, 62813415] + | 109838 ENSRNOE00000080929 1
[2] 8 [62814783, 62814889] + | 109839 ENSRNOE00000080039 2
[3] 8 [62816503, 62816652] + | 109840 ENSRNOE00000212529 3
[4] 8 [62816885, 62817088] + | 109841 ENSRNOE00000080173 4
[5] 8 [62818790, 62818982] + | 109842 ENSRNOE00000352148 5
[6] 8 [62819375, 62819590] + | 109843 ENSRNOE00000080393 6
[7] 8 [62820168, 62820361] + | 109844 ENSRNOE00000212498 7
---
seqlengths:
1 2 3 4 ... JH620517.1 JH620530.1 JH620566.1 JH620632.1
NA NA NA NA ... NA NA NA NA
The names of the transcripts are not readily available. To get the names,
bioMartTranscript_IDs <- names(bioMartExonsByTranscript) bioMartTranscript_IDs[1:10] [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
head(select(bioMartEnsemblRn5, keys=bioMartTranscript_IDs, cols="TXNAME",keytype="TXID")) TXID TXNAME 1 1 ENSRNOT00000044669 2 2 ENSRNOT00000046586 3 3 ENSRNOT00000073861 4 4 ENSRNOT00000073698 5 5 ENSRNOT00000044187 6 6 ENSRNOT00000072186
When you are getting genomic ranges or genomic lists from different databases, it is possible that the chromosomes will be named differently. In this case, we need to rename the chromosomes to do the comparison.
As an example, when we get the genomic features database from biomart the chromosome names may be different than when we get them from ucsc.
Define a vector to rename the chromosomes:
chromRename
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
"chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17"
18 19 20 X
"chr18" "chr19" "chr20" "chrX"
Then rename the chromosomes:
bioMartTranscripts <- transcripts(bioMartEnsemblRn5)
bioMartTranscripts_rename <- renameSeqlevels(bioMartTranscripts,chromRename)
bioMartTranscripts[10000:10000]
GRanges with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] 15 [62345358, 62348598] + | 23376 ENSRNOT00000001360
---
seqlengths:
1 2 3 4 ... JH620517.1 JH620530.1 JH620566.1 JH620632.1
NA NA NA NA ... NA NA NA NA
bioMartTranscripts_rename[10000:10000]
GRanges with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr15 [62345358, 62348598] + | 23376 ENSRNOT00000001360
---
seqlengths:
chr1 chr2 chr3 chr4 ... JH620517.1 JH620530.1 JH620566.1 JH620632.1
NA NA NA NA ... NA NA NA NA
You can also restrict the chromosomes to look at:
newSeqLevelsToKeep [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" [17] "chr17" "chr18" "chr19" "chr20" "chrX" oldSeqLevelsToKeep [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "X"
bioMartTranscripts_subset now only contains transcripts from 1 to 20 and X.
bioMartTranscripts_subset <- keepSeqlevels(bioMartTranscripts,oldSeqLevelsToKeep)
Define GRange List object:
bioMartExonsByTranscript <- exonsBy(bioMartEnsemblRn5,by="tx")
Subset the GRange List:
bioMartExonsByTranscript_subset <- keepSeqlevels(bioMartExonsByTranscript,oldSeqLevelsToKeep)
bioMartExonsByTranscript_rename <- renameSeqlevels(bioMartExonsByTranscript_subset,chromRename)