Skip to content

Genomicfeaturesandranges

walkerhound edited this page Jan 21, 2013 · 23 revisions

Home

This page will show what you need to do to create genomic features databases and genomic ranges.

One purpose is to find overlaps between the cufflink transcripts created by Laura Saba and pre-existing transcripts defined by ensembl etc.

Table of Contents

Required R modules

  1. biomaRt
  2. GenomicFeatures
  3. GenomicRanges
  4. rtracklayer

Installing R modules

I have found the easiest way is, within R, to issue statements like this:

source("http://bioconductor.org/biocLite.R")
biocLite(GenomicFeatures)

If I use the R package installer, I have found that the installation sometimes hangs.

Genomic Features SQLite databases

The genomic features module allows us to create sqlite transcript databases from Biomart, UCSC or user provided data. From these transcript databases, we can find overlapping transcripts easily.

Creating Genomic Features Transcript Databases from UCSC

To see available genomes at UCSC, you require rtracklayer.

ucscGenomes()[,"db"]
  [1] "hg19"    "hg18"    "hg17"    "hg16"    "vicPac1" "dasNov3" "otoGar3" "papHam1" "felCat5" "felCat4" "felCat3"
 [12] "panTro4" "panTro3" "panTro2" "panTro1" "bosTau7" "bosTau6" "bosTau4" "bosTau3" "bosTau2" "canFam3" "canFam2"
 [23] "canFam1" "turTru2" "loxAfr3" "nomLeu2" "nomLeu1" "gorGor3" "cavPor3" "eriEur1" "equCab2" "equCab1" "dipOrd1"
 [34] "triMan1" "calJac3" "calJac1" "pteVam1" "myoLuc2" "mm10"    "mm9"     "mm8"     "mm7"     "micMur1" "hetGla2"
 [45] "hetGla1" "monDom5" "monDom4" "monDom1" "ponAbe2" "ailMel1" "susScr3" "susScr2" "ochPri2" "ornAna1" "oryCun2"
 [56] "rn5"     "rn4"     "rn3"     "rheMac3" "rheMac2" "proCap1" "oviAri1" "sorAra1" "choHof1" "speTri2" "saiBol1"
 [67] "sorAra1" "sarHar1" "echTel1" "tupBel1" "macEug2" "gadMor1" "melUnd1" "galGal4" "galGal3" "galGal2" "latCha1"
 [78] "fr3"     "fr2"     "fr1"     "petMar2" "petMar1" "anoCar2" "anoCar1" "oryLat2" "geoFor1" "oreNil2" "chrPic1"
 [89] "gasAcu1" "tetNig2" "tetNig1" "melGal1" "xenTro3" "xenTro2" "xenTro1" "taeGut1" "danRer7" "danRer6" "danRer5"
[100] "danRer4" "danRer3" "ci2"     "ci1"     "braFlo1" "strPur2" "strPur1" "apiMel2" "apiMel1" "anoGam1" "droAna2"
[111] "droAna1" "droEre1" "droGri1" "dm3"     "dm2"     "dm1"     "droMoj2" "droMoj1" "droPer1" "dp3"     "dp2"    
[122] "droSec1" "droSim1" "droVir2" "droVir1" "droYak2" "droYak1" "caePb2"  "caePb1"  "cb3"     "cb1"     "ce10"   
[133] "ce6"     "ce4"     "ce2"     "caeJap1" "caeRem3" "caeRem2" "priPac1" "aplCal1" "sacCer3" "sacCer2" "sacCer1"

To see all the possible tables at UCSC, do this:

> supportedUCSCtables()
                                      track           subtrack
knownGene                        UCSC Genes               <NA>
knownGeneOld3                Old UCSC Genes               <NA>
wgEncodeGencodeManualV3       Gencode Genes    Genecode Manual
wgEncodeGencodeAutoV3         Gencode Genes      Genecode Auto
wgEncodeGencodePolyaV3        Gencode Genes     Genecode PolyA
ccdsGene                               CCDS               <NA>
refGene                        RefSeq Genes               <NA>
xenoRefGene                    Other RefSeq               <NA>
vegaGene                         Vega Genes Vega Protein Genes
vegaPseudoGene                   Vega Genes   Vega Pseudogenes
ensGene                       Ensembl Genes               <NA>
acembly                       AceView Genes               <NA>
sibGene                           SIB Genes               <NA>
nscanPasaGene                        N-SCAN    N-SCAN PASA-EST
nscanGene                            N-SCAN             N-SCAN
sgdGene                           SGD Genes               <NA>
sgpGene                           SGP Genes               <NA>
geneid                         Geneid Genes               <NA>
genscan                       Genscan Genes               <NA>
exoniphy                           Exoniphy               <NA>
augustusHints                      Augustus     Augustus Hints
augustusXRA                        Augustus   Augustus De Novo
augustusAbinitio                   Augustus Augustus Ab Initio
acescan                             ACEScan               <NA>
lincRNAsTranscripts     lincRNAsTranscripts               <NA>

Not all tables are available for all genomes.

rn5RefGeneUCSC <- makeTranscriptDbFromUCSC(genome="rn5",tablename="refGene")
Download the refGene table ... OK
Download the refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... metadata: OK
Make the TranscriptDb object ... OK
There were 50 or more warnings (use warnings() to see the first 50)

The errors seem to have to do with CDS:

> warnings()
Warning messages:
1: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]],  ... :
  UCSC data anomaly in transcript NM_031007: the cds cumulative length is not a multiple of 3
2: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]],  ... :
  UCSC data anomaly in transcript NM_032071: the cds cumulative length is not a multiple of 3

Here you can see information about chromosomes:

seqinfo(rn5RefGeneUCSC)
Seqinfo of length 2739
seqnames       seqlengths isCircular genome
chr1            290094216      FALSE    rn5
chr2            285068071      FALSE    rn5
chr3            183740530      FALSE    rn5
chr4            248343840      FALSE    rn5
chr5            177180328      FALSE    rn5
chr6            156897508      FALSE    rn5
chr7            143501887      FALSE    rn5
chr8            132457389      FALSE    rn5
chr9            121549591      FALSE    rn5
...                   ...        ...    ...
chrUn_JH620690       1647      FALSE    rn5
chrUn_JH620691       1172      FALSE    rn5
chrUn_JH620692       2686      FALSE    rn5
chrUn_JH620693       9171      FALSE    rn5
chrUn_JH620694       6347      FALSE    rn5
chrUn_JH620695       1669      FALSE    rn5
chrUn_JH620696       7236      FALSE    rn5
chrUn_JH620697       3488      FALSE    rn5
chrUn_JH620698       3129      FALSE    rn5

Creating Genomic Features Transcript Databases from Biomart

To get the latest version of ensemble, you need to point the host to www.ensembl.org or uswest.ensembl.org.

You can see which marts are available by using listMarts from the biomaRt package.

listMarts(host="uswest.ensembl.org")
               biomart               version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 70
2     ENSEMBL_MART_SNP  Ensembl Variation 70
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 70
4    ENSEMBL_MART_VEGA               Vega 50
5             REACTOME    REACTOME (CSHL US)
6                pride        PRIDE (EBI UK)

Then you can list the available datasets:

mart<-useMart("ENSEMBL_MART_ENSEMBL",host="uswest.ensembl.org")
listDatasets(mart=mart)

Among the results you will see:

12       rnorvegicus_gene_ensembl          Rattus norvegicus genes (Rnor_5.0)        Rnor_5.0

Finally, create the transcript database:

bioMartEnsemblRn5 <- makeTranscriptDbFromBiomart(
                         biomart="ENSEMBL_MART_ENSEMBL",
                         dataset='rnorvegicus_gene_ensembl',
                         host="uswest.ensembl.org")

You can now see many properties of the transcript database:

> bioMartEnsemblRn5
TranscriptDb object:
| Db type: TranscriptDb
| Supporting package: GenomicFeatures
| Data source: BioMart
| Genus and Species: Rattus norvegicus
| Resource URL: uswest.ensembl.org:80
| BioMart database: ENSEMBL_MART_ENSEMBL
| BioMart database version: Ensembl Genes 70
| BioMart dataset: rnorvegicus_gene_ensembl
| BioMart dataset description: Rattus norvegicus genes (Rnor_5.0)
| BioMart dataset version: Rnor_5.0
| Full dataset: yes
| miRBase build ID: NA
| transcript_nrow: 29189
| exon_nrow: 216691
| cds_nrow: 208375
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2013-01-21 10:34:48 -0700 (Mon, 21 Jan 2013)
| GenomicFeatures version at creation time: 1.10.1
| RSQLite version at creation time: 0.11.2
| DBSCHEMAVERSION: 1.0

Here is information about the chromosomes:

> seqinfo(bioMartEnsemblRn5)
Seqinfo of length 95
seqnames   seqlengths isCircular genome
1                <NA>       <NA>   <NA>
2                <NA>       <NA>   <NA>
3                <NA>       <NA>   <NA>
4                <NA>       <NA>   <NA>
5                <NA>       <NA>   <NA>
6                <NA>       <NA>   <NA>
7                <NA>       <NA>   <NA>
8                <NA>       <NA>   <NA>
9                <NA>       <NA>   <NA>
...               ...        ...    ...
JH620497.1       <NA>       <NA>   <NA>
JH620498.1       <NA>       <NA>   <NA>
JH620499.1       <NA>       <NA>   <NA>
JH620501.1       <NA>       <NA>   <NA>
JH620511.1       <NA>       <NA>   <NA>
JH620517.1       <NA>       <NA>   <NA>
JH620530.1       <NA>       <NA>   <NA>
JH620566.1       <NA>       <NA>   <NA>
JH620632.1       <NA>       <NA>   <NA>

Creating Genomic Features Transcript Databases from User Provided Data

Saving Genomic Features Transcript Databases

Define the sqlite file name:

bioSqliteFile <- '/Users/walkerhound/Documents/TranscriptDatabases/bioMartEnsemblRn5.sqlite'
rn5RefGeneUCSCFile <- '/Users/walkerhound/Documents/TranscriptDatabases/rn5RefGeneUCSC.sqlite'

Now save the database. Use saveDb from AnnotationDb instead of saveFeatures from GenomicFeatures because saveFeatures has been deprecated.

saveDb(bioMartEnsemblRn5,bioSqliteFile)
saveDb(rn5RefGeneUCSC,rn5RefGeneUCSCFile)

Genomic Ranges

Creating Genomic Ranges and Genomic Range Lists from Transcript Databases

This creates a genomic Range:

bioMartTranscripts <- transcripts(bioMartEnsemblRn5)

Here is an entry from the preceding genomic range:

bioMartTranscripts[10000:10000]
GRanges with 1 range and 2 metadata columns:
      seqnames               ranges strand |     tx_id            tx_name
         <Rle>            <IRanges>  <Rle> | <integer>        <character>
  [1]       15 [62345358, 62348598]      + |     23376 ENSRNOT00000001360
  ---
  seqlengths:
                1              2              3              4 ...     JH620517.1     JH620530.1     JH620566.1     JH620632.1
               NA             NA             NA             NA ...             NA             NA             NA             NA

This creates a genomic Range List:

bioMartExonsByTranscript <- exonsBy(bioMartEnsemblRn5,by="tx")

Here is an entry from the preceding genomic range list:

bioMartExonsByTranscript[15000:15000]
GRangesList of length 1:
$15000 
GRanges with 7 ranges and 3 metadata columns:
      seqnames               ranges strand |   exon_id          exon_name exon_rank
         <Rle>            <IRanges>  <Rle> | <integer>        <character> <integer>
  [1]        8 [62813283, 62813415]      + |    109838 ENSRNOE00000080929         1
  [2]        8 [62814783, 62814889]      + |    109839 ENSRNOE00000080039         2
  [3]        8 [62816503, 62816652]      + |    109840 ENSRNOE00000212529         3
  [4]        8 [62816885, 62817088]      + |    109841 ENSRNOE00000080173         4
  [5]        8 [62818790, 62818982]      + |    109842 ENSRNOE00000352148         5
  [6]        8 [62819375, 62819590]      + |    109843 ENSRNOE00000080393         6
  [7]        8 [62820168, 62820361]      + |    109844 ENSRNOE00000212498         7

---
seqlengths:
              1              2              3              4 ...     JH620517.1     JH620530.1     JH620566.1     JH620632.1
             NA             NA             NA             NA ...             NA             NA             NA             NA

The names of the transcripts are not readily available. To get the names,

bioMartTranscript_IDs <- names(bioMartExonsByTranscript)

bioMartTranscript_IDs[1:10]
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

head(select(bioMartEnsemblRn5, keys=bioMartTranscript_IDs, cols="TXNAME",keytype="TXID"))
  TXID             TXNAME
1    1 ENSRNOT00000044669
2    2 ENSRNOT00000046586
3    3 ENSRNOT00000073861
4    4 ENSRNOT00000073698
5    5 ENSRNOT00000044187
6    6 ENSRNOT00000072186

Comparing Genomic Ranges with Different Chromosome Names

When you are getting genomic ranges or genomic lists from different databases, it is possible that the chromosomes will be named differently. In this case, we need to rename the chromosomes to do the comparison.

As an example, when we get the genomic features database from biomart the chromosome names may be different than when we get them from ucsc.

Define a vector to rename the chromosomes:

chromRename
      1       2       3       4       5       6       7       8       9      10      11      12      13      14      15      16      17 
 "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" 
     18      19      20       X 
"chr18" "chr19" "chr20"  "chrX" 

Then rename the chromosomes:

bioMartTranscripts <- transcripts(bioMartEnsemblRn5)
bioMartTranscripts_rename <- renameSeqlevels(bioMartTranscripts,chromRename)
bioMartTranscripts[10000:10000]
GRanges with 1 range and 2 metadata columns:
      seqnames               ranges strand |     tx_id            tx_name
         <Rle>            <IRanges>  <Rle> | <integer>        <character>
  [1]       15 [62345358, 62348598]      + |     23376 ENSRNOT00000001360
  ---
  seqlengths:
                1              2              3              4 ...     JH620517.1     JH620530.1     JH620566.1     JH620632.1
               NA             NA             NA             NA ...             NA             NA             NA             NA
bioMartTranscripts_rename[10000:10000]
GRanges with 1 range and 2 metadata columns:
      seqnames               ranges strand |     tx_id            tx_name
         <Rle>            <IRanges>  <Rle> | <integer>        <character>
  [1]    chr15 [62345358, 62348598]      + |     23376 ENSRNOT00000001360
  ---
  seqlengths:
             chr1           chr2           chr3           chr4 ...     JH620517.1     JH620530.1     JH620566.1     JH620632.1
               NA             NA             NA             NA ...             NA             NA             NA             NA

You can also restrict the chromosomes to look at:

newSeqLevelsToKeep
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
[17] "chr17" "chr18" "chr19" "chr20" "chrX" 
oldSeqLevelsToKeep
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "X" 

bioMartTranscripts_subset now only contains transcripts from 1 to 20 and X.

bioMartTranscripts_subset <- keepSeqlevels(bioMartTranscripts,oldSeqLevelsToKeep)

Define GRange List object:

bioMartExonsByTranscript <- exonsBy(bioMartEnsemblRn5,by="tx")

Subset the GRange List:

bioMartExonsByTranscript_subset <- keepSeqlevels(bioMartExonsByTranscript,oldSeqLevelsToKeep)
bioMartExonsByTranscript_rename <- renameSeqlevels(bioMartExonsByTranscript_subset,chromRename)

Useful Links

  1. GenomicRanges R Script
  2. Allen's Interval Algebra wikipedia page
  3. iRanges
  4. PDF about IRanges and GenomicRanges

Clone this wiki locally