-
Notifications
You must be signed in to change notification settings - Fork 0
Exon array data formats
walkerhound edited this page Nov 30, 2012
·
9 revisions
There are 4 ps files - core, extended, full and comprehensive. Each file contains a list of probesets that are have the given evidence level. For example, the file MoEx-1_0-st-v1.r2.dt1.mm9.core.ps contains the list of core probesets.
Here is some information about core, extended and full from the paper Exon Probeset Annotations and Transcript Cluster Groupings which is on the website Affymetrix
For the purposes of establishing a hierarchy of gene confidence levels, we partitioned the sources of input transcript annotations into three types. From highest to lowest confidence, the types were labeled core, extended, and full. Broadly defined, the core type consisted of (BLAT) alignments of mRNA with annotated full-length CDS regions, the extended type consisted of cDNA alignments and annotations based on cDNA alignments, and the full type consisted of sets of ab-initio gene predictions.
The core type was so named because the annotations in this type were intended to be the foundation from which we built our gene annotations. The extended type derived its name from the sense that these annotations would extend the boundaries of the core genes. The idea behind the name of the full type was that it would signify all possible content.
There are 4 mps files - core, extended, full and comprehensive. The format of these files is as follows:
probeset_id transcript_cluster_id probeset_list probe_count 6848511 6848511 5200867 5073655 5119214 5360979 16 6864895 6864895 4607687 5430786 4871603 5483756 4904071 4796434 5213657 5362920 5037347 4648676 39 6766590 6766590 5116601 4 6914045 6914045 4688205 4381475 5602010 4406850 5589714 4726311 5541306 4592063 30 6963197 6963197 5481702 5453211 5364789 5440769 4419009 5004542 4458132 5108216 5447337 36 6766588 6766588 5275902 4 6995964 6995964 5020878 4571057 5120475 4489970 5541254 5265690 4600556 4723033 4957862 5120987 5557104 4885841 4905896 4557311 4857590 4896406 5097238 4836308 69 6766587 6766587 4382262 4 6815739 6815739 4812808 5107869 5439601 4867412 4449891 4545773 5457009 26
This information came from a readme file:
PGF (Probe Group File Format): This file is a tab separated text file which has
information about how probes are arranged into probe sets. Lines starting with
a '#%' indicate a command tag. These lines must appear at the top of the file.
They are followed by a 'key=value' syntax and are used to embed header
information into the PGF file. Lines that start with a '##' are comments. These
lines are ignored by the PGF parser. Comments can be used by the user to embed
textual information or to comment out data that they don't want to be used.
What makes a PGF file different from a more typical tab separated text file is
that there is a nested structure to it. Specifically, the leading whitespace
for the data lines is significant as it indicates parent/child relationships.
For instance a line describing a probe set has no leading white space. A line
describing an atom (ie probe pair for expression) has a single tab leading
whitespace and a line describing a probe has two leading tabs. The atoms for a
given probe set are listed following the probe set line and the probes for an
atom are listed following the atom line. Special command tags are used to
describe the content in the data lines. The '#%header0=' command tag lists the
column names for the probe set lines. The '#%header1=' command tag lists the
column names for the atom lines. The '#%header2=' command tag lists the column
names for the probe lines. The specific columns in a pgf file can vary. Unlike
CDF files, the probe set, atom, and probe are all associated with a unique ID;
also, there is no x/y coordinate information for where the probes are on the
chip. (That information is in the CLF file.) Here are the columns and their
descriptions for the contents of the MoEx-1_0-st-v1 PGF file:
- probeset_id: the probe set identifier. unique over the chip, but not
necessarily unique over all chip designs.
- type: the type(s) or class(s) that a probeset, atom, and/or probe
belongs to. See below for more details.
- atom_id: the atom identifier. unique over the chip, but
not necessarily unique over all chip designs.
- probe_id: the probe identifier. unique over the chip, but
not necessarily unique over all chip designs.
- gc_count: the number of G/C bases in the probe
- probe_length: the length of the probe
- interrogation_position: the position of the mismatch base (even though
most of the probesets on this chip design lack a mismatch
probe, the center base, 13th, is still listed as the
interrogation position)
- probe_sequence: the sequence of the probe
Some of the header tags you may encounter in the PGF file:
- #%pgf_version: PGF file format version
- #%chip_type: The chip type expected in the CEL files. Multiple
chip_type lines may be listed when there are multiple
GCOS library files for a given chip design
- #%lib_set_name: The name of the library file set. Generally
there will be one lib_set_name per chip design.
- #%lib_set_version: The version of the content for the
lib_set_name.
- #%create_date: The date that the PGF file was created
- #%header0: Probeset header
- #%header1: Atom header
- #%header2: Probe header
When present, the clf_lib_set_name and clf_lib_set_version tags indicate
which CLF file is referenced by the PGF file. When absent, it is assumed
that the CLF file has the same lib_set_name and lib_set_version as the
PGF file.
Probeset types as reported in the PGF file:
What type of target does the probeset interrogate (a
single probeset may be associated with more than one):
- main: probeset is part of the main design
- control->affx: probeset is a standard AFFX control
- control->chip: probeset is a chip control
- control->bgp->antigenomic: probeset contains background
probes (antigenomic background probes)
- control->bgp->genomic: probeset contains background
probes (genomic background probes)
- normgene->exon: probeset is from an exonic region
of a normalization control gene
- normgene->intron: probeset is from an intronic region
of a normalization control gene
- rescue->FLmRNA->unmapped: probeset consists of probes
tiled across an mRNA transcript which either
didn't align to the genome, or aligned poorly
Probe (not probeset) types as reported in the PGF file:
What type of sample target is interrogated by the probeset:
- at: antisense target probe
- st: sense target probe
Perfect match or mismatch probe:
- pm: perfect match probe
- mm: mismatch probe
Internal Control Probe Types:
- blank
- generic
- jumbo-checkerboard
- thermo
- trigrid
The Affx::WTA::ParsePGF perl module should be used to interface with the PGF
files. This module is in the process of being rewritten in C++ and will
probably be added to the File Parser SDK sometime in the future.
README for Affymetrix Mouse Exon Array Sequence files.
Copyright 2005-2008, Affymetrix Inc.
All Rights Reserved
The content of Affymetrix array sequence files is covered by the
terms of use or license located at http://www.affymetrix.com/site/terms.affx
Array name: Mouse Exon 1.0 ST
Array/chip type: MoEx-1_0-st-v1
Part Numbers: 900831, 900819
Organism: Mus musculus
This README provides a guide to the contents of the array sequence
files for the Affymetrix Mouse Exon Array.
Mouse Exon Array support materials web site:
http://www.affymetrix.com/support/technical/byproduct.affx?product=moexon-st
Contents
--------
I. Probe Sequence Files
A. Probe fasta file
1. Description line attributes
2. Example entry
B. Probe tabular file
1. Column header line
2. Example entry
C. Array design categories
II. Probe Set Sequence File
A. Probe set fasta file
1. Description line attributes
2. Example entry
III. Transcript Cluster Sequence File
A. Transcript cluster fasta file
1. Description line attributes
2. Example entry
I. Probe Sequence Files
-----------------------
All probe sequences, in both fasta and tabular format, are provided in
the orientation they exist on the array and in the 5'->3'
direction. For a sense target (st) array such as the gene and exon
arrays, this corresponds to the reverse complement of the orientation
of the target mRNA sequence.
The probe sequence files include control probes in addition to
all probes from the main design.
I.A. Probe fasta file
The probe fasta file contains all probe sequences from the array in
fasta format. The identifier is composed of 'probe' followed by the
array type followed by the probe id followed by the x and y
position of the probe on the array, with each of these items
separate by a colon ':' character. The identifier is termined by a
semicolon ';'.
I.A.1. Description line attributes
Additional attributes for each probe are included in the
description line in tag=value pairs. The following tags are
provided:
Attribute Description
------------- ------------------------------------------
TranscriptClusterID Transcript cluster identifier (integer)
Assembly Genome assembly version from array design time
Seqname Sequence name for genomic location of probe
Start Starting coordinate of probe genomic location (1-based)
Stop Ending coordinate of probe genomic location (1-based)
Strand Sequence strand of probe genomic location (+ or -)
Sense/Antisense Strandedness of the target which the probe detects
category Array design category of the probe (described below)
* Note: The sense/antisense field is the only one that is not
tag=value, but is either the string 'Sense' or 'Antisense'.
I.A.2. Example entry
Shown is an example fasta formatted probe sequence entry from the
human exon array.
>probe:HuEx-1_0-st-v2:494998;917:193; ProbeSetID=2315101; Assembly=build-34/hg16; Seqname=chr1; Start=1788; Stop=1812; Strand=+; Sense; category=main
CACGGGAAGTCTGGGCTAAGAGACA
I.B. Probe tabular file
The probe tabular data file contains all probe sequences from the
array in tab-delimited format. Column headers are indicated in the
first line.
I.B.1. Column header line
Column Name Description
----------------- ------------------------------------------
Probe ID Probe identifier (integer)
Probe Set ID Probe set identifier (integer)
probe x X coordinate for probe location on array
probe y Y coordinate for probe location on array
assembly Genome assembly version from array design time
seqname Sequence name for genomic location of probe
start Starting coordinate of probe genomic location (1-based)
stop Ending coordinate of probe genomic location (1-based)
strand Sequence strand of probe genomic location (+ or -)
probe sequence Probe sequence
target strandedness Strandedness of the target which the probe detects
category Array design category of the probe (described below)
I.B.2. Example entry
Shown is an example column header line and data line from the human
exon array.
Probe ID Probe Set ID probe x probe y assembly seqname start stop strand probe sequence target strandedness category
494998 2315101 917 193 build-34/hg16 chr1 1788 1812 + CACGGGAAGTCTGGGCTAAGAGACA Sense main
I.C. Array design categories
Both the probe fasta and tab files contain an indication of the
array design category of each probe. Here is a description of the
different types of categories.
Category Description
------------------------ --------------------------------------
main part of the main design
control->affx a standard AFFX control
control->chip a chip control
control->bgp->antigenomic antigenomic background probes
control->bgp->genomic genomic background probes
normgene->exon from an exonic region of a
normalization control gene
normgene->intron from an intronic region of a
normalization control gene
rescue->FLmRNA->unmapped probes were tiled across an mRNA
transcript which either did not align
to the genome, or aligned poorly
II. Probe Set Sequence File
-----------------------------
Probe set sequences consist of the contiguous genomic sequence
starting at the beginning of the first probe and ending at the end of
the last probe in the set as they are aligned to the genome. They are
provided in the orientation they exist in the mRNA in 5'->3'
direction.
Probe set sequences are extracted from the version of the
genome that was used for array design. During NetAffx annotation, the
array design may be lifted to a more current version of the genome
assembly. There could be differences between the design-time and
annotation-time versions of the probe set sequences. If you
require sequence data based on a version of the genome different from
the design-time assembly, contact Affymetrix support. The file
containing the probe set sequences is tagged with the name of
the genome assembly on which it is based.
The probe set sequence file includes the exon and intron normalization
control probe sets in addition to entries from the main design.
Categories of probe sets are the same as for probes, described above
in section I.C.
II.A. Probe set fasta file
The probe set fasta sequence file contains probe set sequences from
the array in fasta format. The identifier is composed of 'probe_set'
followed by the array type followed by the probe set ID, with each
of these items separate by a colon ':' character. The identifier is
termined by a semicolon ';'.
II.A.1. Description line attributes
Additional attributes for each probe set are included in the
description line in tag=value pairs. The following tags are
provided:
Attribute Description
------------- ------------------------------------------
Assembly Genome assembly version from array annotation time
Seqname Sequence name for genomic location of probe set
Start Starting coordinate of probe set genomic location (1-based)
Stop Ending coordinate of probe set genomic location (1-based)
Strand Sequence strand of probe set genomic location (+ or -)
Length Length of the probe set
category Array design category of the probe set
II.A.2. Example entry
Shown is an example fasta formatted probe set sequence entry from
the human exon array.
>probe_set:HuEx-1_0-st-v2:2315101; Assembly=build-36/hg18; Seqname=chr1; Start=1788; Stop=2030; Strand=+; Length=243; category=main
TGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTC
TTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAG
GGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGG
ATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAG
GCA
III. Transcript Cluster Sequence File
-------------------------------------
Transcript cluster sequences are created by removing all intronic
regions from the transcript cluster sequence and splicing
together all constituent exons into a single sequence for each
transcript cluster. They are provided in the orientation they exist in
the mRNA in 5'->3' direction.
The length given for the transcript cluster sequence is the total
length of all spliced exon clusters within the transcript cluster,
removing any intronic sequence. This will differ from the length
obtained by taking the difference between the start and stop
genomic coordinates for the transcript cluster, which would include
any intronic sequence.
Transcript cluster sequences are extracted from the version of the
genome that was used for array design. During NetAffx annotation, the
array design may be lifted to a more current version of the genome
assembly. There could be differences between the design-time and
annotation-time versions of the transcript cluster sequences. If you
require sequence data based on a version of the genome different from
the design-time assembly, contact Affymetrix support. The file
containing the transcript cluster sequences is tagged with the name of
the genome assembly on which it is based.
The transcript cluster sequence file does not include any
controls, only entries from the main design.
III.A. Transcript cluster fasta file
The transcript cluster fasta sequence file contains transcript
cluster sequences from the array in fasta format. The identifier is
composed of 'transcript_cluster' followed by the array type
followed by the transcript cluster ID, with each of these items
separate by a colon ':' character. The identifier is termined by a
semicolon ';'.
I.A.1. Description line attributes
Additional attributes for each transcript cluster are included in
the description line in tag=value pairs. The following tags are
provided:
Attribute Description
------------- ------------------------------------------
Assembly Genome assembly version from array annotation time
Seqname Sequence name for genomic location of TC
Start Starting coordinate of TC genomic location (1-based)
Stop Ending coordinate of TC genomic location (1-based)
Strand Sequence strand of TC genomic location (+ or -)
Length Length of concatenated exon clusters in the TC
III.A.2. Example entry
Shown is an example fasta formatted transcript cluster sequence
entry from the human exon array.
>transcript_cluster:HuEx-1_0-st-v2:2315109; Assembly=build-36/hg18; Seqname=chr1; Start=9271; Stop=9575; Strand=+; Length=305;
GGAAAGGGAGGGGGAGGATGTGGGATGGTGGAGGGGCTGCAGACTCTGGGCTAGGGAAAG
CTGGGATGTCTCTAAAGGTTGGAATGAATGGCCTAGAATCCGACCCAATAAGCCAAAGCC
ACTTCCACCAACGTTAGAAGGCCTTGGCCCCCAGAGAGCCAATTTCACAATCCAGAAGTC
CCCGTGCCCTAAAGGGTCTGCCCTGATTACTCCTGGCTCCTTGTGTGCAGGGGGCTCAGG
CATGGCAGGGCTGGGAGTACCAGCAGGCACTCAAGCGGCTTAAGTGTTCCATGACAGACT
GGTAT