Skip to content

Seqnature: incorporate SNPs and Indels into a reference genome

Notifications You must be signed in to change notification settings

TabakoffLab/Seqnature

 
 

Repository files navigation

#Seqnature

If you use Seqnature for your published research, please cite:

Munger SC, Raghupathy N, Choi KB, Simons AK, Gatti DM, Hinerfeld DA, Svenson KL, Keller MP, Attie AD, Hibbs MA, Graber JH, Chesler EJ, and Churchill GA. RNA-seq alignment to individualized genomes improves transcript abundance estimates in multiparent populations. GENETICS, In Press.

The programs' author is Al Simons (al.simons at jax.org). Please contact him to report any problems with the programs.

Seqnature is a set of programs to create custom sequences from a reference sequence and a pair of VCF files containing SNP and Indel information.
It was created for, and most heavily tested on, mouse.

If you have gtf files with annotations based on the same reference, the programs can create an updated gtf, reflecting the Indels.

##Creating new sequence and annotations for strains This section discusses the programs used to create strain-specific references and annotation files, build_new_sequence_from_vcfs.py and adjust_annotations.py. For information about building individualized reference sequences and annotation files, please see the section "Creating new sequence and annotations for individuals", below.

The program build_new_sequence_from_vcfs.py takes a reference sequence, and two vcf files, one each of SNPs and Indels. It produces a new sequence file and a set of bookkeeping files that are used by the second program in the set, adjust_annotations.py. The following options are required:

  • --indels
  • --reference
  • --snps

The program adjust_annotations.py takes a gtf file with coordinates matching the reference sequence previously processed, and the bookkeeping files produced by build_new_sequence_from_vcfs.py. It produces an updated gtf file reflecting the effects of the Indels. While most commonly used on gtf files, it is able to process any tab separated file with columns containing chromosome and positions within the chromosome that are to be updated.

The Seqnature programs expect chromosome names of the form "1", "2", "X", etc, not "chr1", "chr2", etc.

###Using build_new_sequence_from_vcfs.py

$ ./build_new_sequence_from_vcfs.py --help
Usage: build_new_sequence_from_vcfs.py [options] strain
    Strain must be a column header in the two VCF files.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -d DIR, --output-dir=DIR
                        Directory for output files (must already exist).
                        (Default: current working directory.)
  -i INDELS, --indels=INDELS
                        path to indels vcf file
  -o OUTPUT, --output=OUTPUT
                        File into which the genome is written. (Default:
                        stdout)
  -p, --pass-only       Only use SNPs and Indels that passed quality
                        filters
  -r REFERENCE, --reference=REFERENCE
                        Path to the reference file used as the base.
  -s SNPS, --snps=SNPS  Path to the snps vcf file

###Using adjust_annotations.py

$ ./adjust_annotations.py --help
Usage: adjust_annotations.py [options] annotation-file strain-identifier

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -c CHRCOL, --chr-column=CHRCOL
                        Column with chromosome info. (Required, one-based.)
  -C COMMENTS, --comments-file=COMMENTS
                        File to write adjustment comments to. (Default: no
                        comments file.)
  -d DIR, --directory=DIR
                        path to directory containing offset files (default:
                        current directory)
  -e ENDCOL, --end-column=ENDCOL
                        The column containing the position of the end of the
                        feature. (Required, one-based.)
  -n NEAR, --near=NEAR  Report if a feature position is within NEAR bases of
                        an indel. (Optional, default is no report.)
  -o OFN, --output-file=OFN
                        Output filename (Optional, default: stdout)
  -p POSCOLS, --position-columns=POSCOLS
                        Comma-separated list of other columns with positions
                        to be adjusted. (Optional, one-based.)
  -q, --quiet           Operate quietly. Do not report chromosomes and contigs
                        not in the Indel vcf file.
  -s STARTCOL, --start-column=STARTCOL
                        The column containing the position of the end of the
                        feature. (Required, one-based.)
  -t TYPECOL, --type-column=TYPECOL
                        Column with feature type info. (Optional, one-based.)

##Creating new sequence and annotations for individuals This section describes the programs used to create new reference sequences and annotation files for individual animals which have been haplotyped: reconstruct_NCBI.py, build_individualized_genome.py and adjust_individualized_annotations.py.

An example shell script for running all of these steps other than reordering the reference, is included as do_DO.sh.

###Reordering the reference for performance The program reconstruct_NCBI.py reorders the original genome reference into chromosome order, and splits out all the NT_ unplaced extents and chromosomes Y and MT. This allows marked improvements in the performance of the program that builds the individualized genome. This program only has to be run once, not once per animal.

###Building the individualized genome reference After the reference has been reordered, the program build_individualized_genome.py will create one chromatid's reference for the individual. This program is run twice, once per haplotype file ('L' and 'R'). At the end of the 'R' run, the program includes the unplaced NT_ contigs and chr MT from the original reference. If the sample is male, the unmodified chromosome Y is appended as well. After running the program twice, the output files are concatenated to form the complete diploid reference.

###Adjusting the annotations file Once the genome has been created by two invocations of build_individualized_genome.py, the annotation gtf file can be adjusted by program adjust_individualized_annotations.py. Just as with build_individualized_genome.py, adjust_individualized_annotations.py is run twice, once per haplotype block, and the output from the two runs are concatenated.

###Using reconstruct_NCBI.py Running the program without any arguments produces the following usage information:

$ ./reconstruct_NCBI.py

USAGE: reconstruct_NCBI.py reference-file.fa

Output is written to three files in the current working directory.
The file names are based on the input .fa name.  Assume the
input file is named NCBIM37.fa.  The output files would be
   - NCBIM37_ordered.fa
   - NCBIM37_contigs.fa
   - NCBIM37_Y.fa

The program for which this is a preprocessor, build_individualized_genome.py,
takes in the basename, optionally with a path.  In our example, this 
would be NCBIM37 or /some/path/to/NCBIM37

About chromosome names: This program assumes NCBI-style chromosome
names "1", "2", "X", etc., not mm9-style names, e.g., "chr1".

###Using build_individualized_genome.py ./build_individualized_genome.py --help Usage: build_individualized_genome.py [options] animal-id L-R-designation

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -c CONTROL, --control=CONTROL
                        path to control file
  -d DIR, --output-dir=DIR
                        Directory for output files (must already exist).
                        (Default: current working directory.)
  -i INDELS, --indels=INDELS
                        path to indels vcf file
  -o OUTPUT, --output-sequence=OUTPUT
                        File (fasta) into which to write the sequence.
                        (Default: stdout)
  -p, --pass-only       Only use SNPs and Indels that passed quality
                        filters
  -r REFERENCE, --reference=REFERENCE
                        Base name of the reference file that has been
                        preprocessed by reconstruct_NCBI.py, e.g.,
                        /some/path/to/NCBIM37; not /path/NCBIM37_ordered.fa
  -s SNPS, --snps=SNPS  Path to the snps vcf file

###Using adjust_individualized_annotations.py ./adjust_individualized_annotations.py --help Usage: adjust_individualized_annotations.py [options] annotation-file DO-identifier L-or-R

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -c CHRCOL, --chr-column=CHRCOL
                        Column with chromosome info. (Required, one-based.)
  -C COMMENTS, --comments-file=COMMENTS
                        File to write adjustment comments to. (Default: no
                        comments file.)
  -d DIR, --dir=DIR     path to directory containing offset files (default:
                        current directory)
  -e ENDCOL, --end-column=ENDCOL
                        The column containing the position of the end of the
                        feature. (Required, one-based.)
  -n NEAR, --near=NEAR  Report if a feature position is within near bases of
                        an indel. (Optional, default is no report.)
  -o OFN, --output-file=OFN
                        Output filename (Optional, default: stdout)
  -p POSCOLS, --position-columns=POSCOLS
                        Comma-separated list of other columns with positions
                        to be adjusted. (Optional, one-based.)
  -s STARTCOL, --start-column=STARTCOL
                        The column containing the position of the end of the
                        feature. (Required, one-based.)
  -t TYPECOL, --type-column=TYPECOL
                        Column with feature type info. (Optional, one-based.)
  -x EXTENTS, --extents=EXTENTS
                        File containing the strain extents.

##License

This software is licensed under GPL V3 or later.

About

Seqnature: incorporate SNPs and Indels into a reference genome

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Shell 1.6%