Skip to content

Exon array data formats

walkerhound edited this page Nov 30, 2012 · 9 revisions

Exon Arrays

Table of Contents

Affymetrix Files

ps files

There are 4 ps files - core, extended, full and comprehensive. Each file contains a list of probesets that are have the given evidence level. For example, the file MoEx-1_0-st-v1.r2.dt1.mm9.core.ps contains the list of core probesets.

Here is some information about core, extended and full from the paper Exon Probeset Annotations and Transcript Cluster Groupings which is on the website Affymetrix

For the purposes of establishing a hierarchy of gene confidence levels, we
partitioned the sources of input transcript annotations into three types. From
highest to lowest confidence, the types were labeled core, extended, and full.
Broadly defined, the core type consisted of (BLAT) alignments of mRNA with
annotated full-length CDS regions, the extended type consisted of cDNA
alignments and annotations based on cDNA alignments, and the full type
consisted of sets of ab-initio gene predictions.
The core type was so named because the annotations in this type were intended
to be the foundation from which we built our gene annotations. The extended
type derived its name from the sense that these annotations would extend the
boundaries of the core genes. The idea behind the name of the full type was that
it would signify all possible content.

mps files

There are 4 mps files - core, extended, full and comprehensive. The format of these files is as follows:

probeset_id	transcript_cluster_id	probeset_list	probe_count
6848511	6848511	5200867 5073655 5119214 5360979 	16
6864895	6864895	4607687 5430786 4871603 5483756 4904071 4796434 5213657 5362920 5037347 4648676 	39
6766590	6766590	5116601 	4
6914045	6914045	4688205 4381475 5602010 4406850 5589714 4726311 5541306 4592063 	30
6963197	6963197	5481702 5453211 5364789 5440769 4419009 5004542 4458132 5108216 5447337 	36
6766588	6766588	5275902 	4
6995964	6995964	5020878 4571057 5120475 4489970 5541254 5265690 4600556 4723033 4957862 5120987 5557104 4885841 4905896 4557311 4857590 4896406 5097238 4836308 	69
6766587	6766587	4382262 	4
6815739	6815739	4812808 5107869 5439601 4867412 4449891 4545773 5457009 	26

Probe Group File (PGF) Format

This information came from a readme file:

PGF (Probe Group File Format): This file is a tab separated text file which has
information about how probes are arranged into probe sets. Lines starting with
a '#%' indicate a command tag. These lines must appear at the top of the file.
They are followed by a 'key=value' syntax and are used to embed header
information into the PGF file. Lines that start with a '##' are comments. These
lines are ignored by the PGF parser. Comments can be used by the user to embed
textual information or to comment out data that they don't want to be used.
What makes a PGF file different from a more typical tab separated text file is
that there is a nested structure to it. Specifically, the leading whitespace
for the data lines is significant as it indicates parent/child relationships.
For instance a line describing a probe set has no leading white space. A line
describing an atom (ie probe pair for expression) has a single tab leading
whitespace and a line describing a probe has two leading tabs. The atoms for a
given probe set are listed following the probe set line and the probes for an
atom are listed following the atom line. Special command tags are used to
describe the content in the data lines. The '#%header0=' command tag lists the
column names for the probe set lines. The '#%header1=' command tag lists the
column names for the atom lines. The '#%header2=' command tag lists the column
names for the probe lines. The specific columns in a pgf file can vary. Unlike
CDF files, the probe set, atom, and probe are all associated with a unique ID;
also, there is no x/y coordinate information for where the probes are on the
chip. (That information is in the CLF file.) Here are the columns and their
descriptions for the contents of the MoEx-1_0-st-v1 PGF file:

   - probeset_id: the probe set identifier. unique over the chip, but not
                  necessarily unique over all chip designs.
   - type: the type(s) or class(s) that a probeset, atom, and/or probe
           belongs to. See below for more details.
   - atom_id: the atom identifier. unique over the chip, but
              not necessarily unique over all chip designs.
   - probe_id: the probe identifier. unique over the chip, but
               not necessarily unique over all chip designs.
   - gc_count: the number of G/C bases in the probe
   - probe_length: the length of the probe
   - interrogation_position: the position of the mismatch base (even though
		most of the probesets on this chip design lack a mismatch
		probe, the center base, 13th, is still listed as the
		interrogation position)
   - probe_sequence: the sequence of the probe

Some of the header tags you may encounter in the PGF file:

   - #%pgf_version: PGF file format version 
   - #%chip_type: The chip type expected in the CEL files. Multiple
		chip_type lines may be listed when there are multiple
		GCOS library files for a given chip design
   - #%lib_set_name: The name of the library file set. Generally
		there will be one lib_set_name per chip design.
   - #%lib_set_version: The version of the content for the 
		lib_set_name.
   - #%create_date: The date that the PGF file was created
   - #%header0: Probeset header
   - #%header1: Atom header
   - #%header2: Probe header

When present, the clf_lib_set_name and clf_lib_set_version tags indicate
which CLF file is referenced by the PGF file. When absent, it is assumed
that the CLF file has the same lib_set_name and lib_set_version as the
PGF file.

Probeset types as reported in the PGF file:

  What type of target does the probeset interrogate (a
  single probeset may be associated with more than one):
    - main: probeset is part of the main design
    - control->affx: probeset is a standard AFFX control
    - control->chip: probeset is a chip control
    - control->bgp->antigenomic: probeset contains background
	probes (antigenomic background probes)
    - control->bgp->genomic: probeset contains background
	probes (genomic background probes)
    - normgene->exon: probeset is from an exonic region
	of a normalization control gene
    - normgene->intron: probeset is from an intronic region
	of a normalization control gene
    - rescue->FLmRNA->unmapped: probeset consists of probes
	tiled across an mRNA transcript which either
	didn't align to the genome, or aligned poorly

Probe (not probeset) types as reported in the PGF file:

  What type of sample target is interrogated by the probeset:
    - at: antisense target probe
    - st: sense target probe

  Perfect match or mismatch probe:
    - pm: perfect match probe
    - mm: mismatch probe

  Internal Control Probe Types:
    - blank
    - generic
    - jumbo-checkerboard
    - thermo 
    - trigrid 

The Affx::WTA::ParsePGF perl module should be used to interface with the PGF
files. This module is in the process of being rewritten in C++ and will 
probably be added to the File Parser SDK sometime in the future.

Sequence files

README for Affymetrix Mouse Exon Array Sequence files.

Copyright 2005-2008, Affymetrix Inc.
All Rights Reserved

The content of Affymetrix array sequence files is covered by the
terms of use or license located at http://www.affymetrix.com/site/terms.affx

Array name:      Mouse Exon 1.0 ST
Array/chip type: MoEx-1_0-st-v1
Part Numbers:    900831, 900819
Organism:        Mus musculus

This README provides a guide to the contents of the array sequence
files for the Affymetrix Mouse Exon Array. 

Mouse Exon Array support materials web site:
http://www.affymetrix.com/support/technical/byproduct.affx?product=moexon-st


Contents
--------

I. Probe Sequence Files
   A. Probe fasta file
      1. Description line attributes
      2. Example entry
   B. Probe tabular file
      1. Column header line
      2. Example entry
   C. Array design categories

II. Probe Set Sequence File
   A. Probe set fasta file
      1. Description line attributes
      2. Example entry

III. Transcript Cluster Sequence File
   A. Transcript cluster fasta file
      1. Description line attributes
      2. Example entry

I. Probe Sequence Files
-----------------------

All probe sequences, in both fasta and tabular format, are provided in
the orientation they exist on the array and in the 5'->3'
direction. For a sense target (st) array such as the gene and exon
arrays, this corresponds to the reverse complement of the orientation
of the target mRNA sequence.

The probe sequence files include control probes in addition to
all probes from the main design.

I.A. Probe fasta file

   The probe fasta file contains all probe sequences from the array in
   fasta format. The identifier is composed of 'probe' followed by the
   array type followed by the probe id followed by the x and y
   position of the probe on the array, with each of these items
   separate by a colon ':' character. The identifier is termined by a
   semicolon ';'.

   I.A.1. Description line attributes

   Additional attributes for each probe are included in the
   description line in tag=value pairs. The following tags are
   provided: 

      Attribute                Description
    -------------          ------------------------------------------
      TranscriptClusterID   Transcript cluster identifier (integer)
      Assembly              Genome assembly version from array design time
      Seqname               Sequence name for genomic location of probe 
      Start                 Starting coordinate of probe genomic location (1-based)
      Stop                  Ending coordinate of probe genomic location (1-based)
      Strand                Sequence strand of probe genomic location (+ or -)
      Sense/Antisense       Strandedness of the target which the probe detects
      category              Array design category of the probe (described below)

     * Note: The sense/antisense field is the only one that is not
       tag=value, but is either the string 'Sense' or 'Antisense'. 

   I.A.2. Example entry

   Shown is an example fasta formatted probe sequence entry from the
   human exon array. 

>probe:HuEx-1_0-st-v2:494998;917:193; ProbeSetID=2315101; Assembly=build-34/hg16; Seqname=chr1; Start=1788; Stop=1812; Strand=+; Sense; category=main
CACGGGAAGTCTGGGCTAAGAGACA


I.B. Probe tabular file

   The probe tabular data file contains all probe sequences from the
   array in tab-delimited format. Column headers are indicated in the
   first line.

   I.B.1. Column header line

      Column Name                        Description
    -----------------       ------------------------------------------
      Probe ID               Probe identifier (integer)
      Probe Set ID           Probe set identifier (integer)
      probe x                X coordinate for probe location on array
      probe y                Y coordinate for probe location on array
      assembly               Genome assembly version from array design time
      seqname                Sequence name for genomic location of probe 
      start                  Starting coordinate of probe genomic location (1-based)
      stop                   Ending coordinate of probe genomic location (1-based)
      strand                 Sequence strand of probe genomic location (+ or -)
      probe sequence         Probe sequence
      target strandedness    Strandedness of the target which the probe detects
      category               Array design category of the probe (described below)


   I.B.2. Example entry

   Shown is an example column header line and data line from the human
   exon array. 

Probe ID	Probe Set ID	probe x	probe y	assembly	seqname	start	stop	strand	probe sequence	target strandedness	category
494998	2315101	917	193	build-34/hg16	chr1	1788	1812	+	CACGGGAAGTCTGGGCTAAGAGACA	Sense	main


I.C. Array design categories

    Both the probe fasta and tab files contain an indication of the
    array design category of each probe. Here is a description of the
    different types of categories.

              Category                   Description
       ------------------------    --------------------------------------
        main                       part of the main design 
        control->affx              a standard AFFX control
        control->chip              a chip control
        control->bgp->antigenomic  antigenomic background probes
        control->bgp->genomic      genomic background probes 
        normgene->exon             from an exonic region of a
                                   normalization control gene   
        normgene->intron           from an intronic region of a
                                   normalization control gene 
        rescue->FLmRNA->unmapped   probes were tiled across an mRNA
                                   transcript which either did not align
                                   to the genome, or aligned poorly  


II. Probe Set Sequence File
-----------------------------

Probe set sequences consist of the contiguous genomic sequence
starting at the beginning of the first probe and ending at the end of
the last probe in the set as they are aligned to the genome. They are
provided in the orientation they exist in the mRNA in 5'->3'
direction.

Probe set sequences are extracted from the version of the
genome that was used for array design. During NetAffx annotation, the
array design may be lifted to a more current version of the genome
assembly. There could be differences between the design-time and
annotation-time versions of the probe set sequences. If you
require sequence data based on a version of the genome different from
the design-time assembly, contact Affymetrix support. The file
containing the probe set sequences is tagged with the name of
the genome assembly on which it is based.

The probe set sequence file includes the exon and intron normalization
control probe sets in addition to entries from the main design.
Categories of probe sets are the same as for probes, described above
in section I.C.

II.A. Probe set fasta file

   The probe set fasta sequence file contains probe set sequences from
   the array in fasta format. The identifier is composed of 'probe_set'
   followed by the array type followed by the probe set ID, with each
   of these items separate by a colon ':' character. The identifier is
   termined by a semicolon ';'.

   II.A.1. Description line attributes

   Additional attributes for each probe set are included in the
   description line in tag=value pairs. The following tags are
   provided:

      Attribute                Description
    -------------    ------------------------------------------
      Assembly        Genome assembly version from array annotation time
      Seqname         Sequence name for genomic location of probe set
      Start           Starting coordinate of probe set genomic location (1-based)
      Stop            Ending coordinate of probe set genomic location (1-based)
      Strand          Sequence strand of probe set genomic location (+ or -)
      Length          Length of the probe set
      category        Array design category of the probe set

   II.A.2. Example entry

   Shown is an example fasta formatted probe set sequence entry from
   the human exon array. 

>probe_set:HuEx-1_0-st-v2:2315101; Assembly=build-36/hg18; Seqname=chr1; Start=1788; Stop=2030; Strand=+; Length=243; category=main
TGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTC
TTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAG
GGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGG
ATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAG
GCA


III. Transcript Cluster Sequence File
-------------------------------------

Transcript cluster sequences are created by removing all intronic
regions from the transcript cluster sequence and splicing
together all constituent exons into a single sequence for each
transcript cluster. They are provided in the orientation they exist in
the mRNA in 5'->3' direction.

The length given for the transcript cluster sequence is the total
length of all spliced exon clusters within the transcript cluster,
removing any intronic sequence. This will differ from the length
obtained by taking the difference between the start and stop
genomic coordinates for the transcript cluster, which would include
any intronic sequence.

Transcript cluster sequences are extracted from the version of the
genome that was used for array design. During NetAffx annotation, the
array design may be lifted to a more current version of the genome
assembly. There could be differences between the design-time and
annotation-time versions of the transcript cluster sequences. If you
require sequence data based on a version of the genome different from
the design-time assembly, contact Affymetrix support. The file
containing the transcript cluster sequences is tagged with the name of
the genome assembly on which it is based.

The transcript cluster sequence file does not include any
controls, only entries from the main design. 

III.A. Transcript cluster fasta file

   The transcript cluster fasta sequence file contains transcript
   cluster sequences from the array in fasta format. The identifier is
   composed of 'transcript_cluster' followed by the array type
   followed by the transcript cluster ID, with each of these items
   separate by a colon ':' character. The identifier is termined by a
   semicolon ';'.

   I.A.1. Description line attributes

   Additional attributes for each transcript cluster are included in
   the description line in tag=value pairs. The following tags are
   provided:

      Attribute                Description
    -------------    ------------------------------------------
      Assembly        Genome assembly version from array annotation time
      Seqname         Sequence name for genomic location of TC
      Start           Starting coordinate of TC genomic location (1-based)
      Stop            Ending coordinate of TC genomic location (1-based)
      Strand          Sequence strand of TC genomic location (+ or -)
      Length          Length of concatenated exon clusters in the TC

   III.A.2. Example entry 

   Shown is an example fasta formatted transcript cluster sequence
   entry from the human exon array. 

>transcript_cluster:HuEx-1_0-st-v2:2315109; Assembly=build-36/hg18; Seqname=chr1; Start=9271; Stop=9575; Strand=+; Length=305;
GGAAAGGGAGGGGGAGGATGTGGGATGGTGGAGGGGCTGCAGACTCTGGGCTAGGGAAAG
CTGGGATGTCTCTAAAGGTTGGAATGAATGGCCTAGAATCCGACCCAATAAGCCAAAGCC
ACTTCCACCAACGTTAGAAGGCCTTGGCCCCCAGAGAGCCAATTTCACAATCCAGAAGTC
CCCGTGCCCTAAAGGGTCTGCCCTGATTACTCCTGGCTCCTTGTGTGCAGGGGGCTCAGG
CATGGCAGGGCTGGGAGTACCAGCAGGCACTCAAGCGGCTTAAGTGTTCCATGACAGACT
GGTAT

Clone this wiki locally