diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..f376dca Binary files /dev/null and b/.DS_Store differ diff --git a/README.md b/README.md index 3feb741..0d6ec02 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,14 @@ -#HUPAN +# HUPAN: HUman Pan-genome ANalysis --- **1. Introduction** -The human reference genome is still incomplete, especially for those population-specific or individual-specific regions, which may have important functions. It encourages us to build the pan-genome of human population. Previously, our team developed a "map-to-pan" strategy, [EUPAN][1], specific for eukaryotic pan-genome analysis. However, due to the large genome size of individual human genome, [EUPAN][2] is not suit for pan-genome analysis involving in hundreds of individual genomes. Here, we present an improved tool, HUPAN (Human Pan-genome Analysis), for human pan-genome analysis. +The human reference genome is still incomplete, especially for those population-specific or individual-specific regions, which may have important functions. It encourages us to build the pan-genome of human population. Previously, our team developed a "map-to-pan" strategy, [EUPAN][1], specific for eukaryotic pan-genome analysis. However, due to the large genome size of individual human genome, [EUPAN][2] is not suit for pan-genome analysis involving in hundreds of individual genomes. Here, we present an improved tool, HUPAN (HUman Pan-genome ANalysis), for human pan-genome analysis. + +The HUPAN homepage is http://cgm.sjtu.edu.cn/hupan/ + +The HUPAN paper is available at https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1751-y **2. Installation** @@ -15,9 +19,11 @@ The human reference genome is still incomplete, especially for those population- R is utilized for visualization and statistical tests in HUPAN toolbox. Please install R first and make sure R and Rscript are under your PATH. - - R packages Several R packages are needed including ggplot2, reshape2 - and ape packages. Follow the Installation step, - - or you can install the packages by yourself. + + - R packages + + Several R packages are needed including ggplot2, reshape2 + and ape packages. Follow the Installation step, or you can install the packages by yourself. **Installation procedures** @@ -26,9 +32,7 @@ The human reference genome is still incomplete, especially for those population- `git clone git@github.com:SJTU-CGM/HUPAN.git` - Alternatively, you also could obtain the toolbox in the [HUPAN][4] - website; - - - Please uncompress the HUPAN toolbox package: + website and uncompress the HUPAN toolbox package: `tar zxvf HUPAN-v**.tar.gz` @@ -133,7 +137,8 @@ ii. If the reads are not so good, the users could trim or filter low-quality rea hupanSLURM trim -w 100 -m 100 data/ filter/ /path/to/Trimmomatic Results could be found in the trim or filter directory. -iii.After trimming or filtration of reads, the sequencing quality should be evaluated again by `qualitySta`, and if the trimming results are still not good for subsequent analyses, new parameters should be given and the above steps should be conducted for several times. + +iii. After trimming or filtration of reads, the sequencing quality should be evaluated again by `qualitySta`, and if the trimming results are still not good for subsequent analyses, new parameters should be given and the above steps should be conducted for several times. **(3) *De novo* assembly of individual genomes** @@ -147,7 +152,7 @@ Please note that this startegy requires huge memory for assembly an individual h ii.Assembly by the iterative use of SOAPDenovo2. Not Recommend. - hupanSLURM linearK data assembly_linearK/ /path/to/SOAPDenovo2 + hupanSLURM assemble linearK data assembly_linearK/ /path/to/SOAPDenovo2 iii. Assembly by [SGA][11]. @@ -176,10 +181,15 @@ iv. Two types of non-reference sequences, fully unaligned sequences and partiall v. Non-reference sequences from multiple individuals are merged: hupanSLURM mergeUnalnCtg Unalign_result/data/ mergeUnalnCtg_result + + Alternatively, if you conducted step iv by `hupan`, you can find the merged result in the Unalign_result/total: + + mv Unalign_result/total/ mergeUnalnCtg_result + + +**(5) Remove redundancy and potential contamination sequences** -**(5) Remove redundancy and potential commination sequences** - -After obtaining the non-reference sequences from multiple individuals, redundant sequences between different individuals should be excluded, and the potential commination sequences from non-human species are also removed for further analysis. +After obtaining the non-reference sequences from multiple individuals, redundant sequences between different individuals should be excluded, and the potential contamination sequences from non-human species are also removed for further analysis. i. The step of remove redundancy sequences is conducted by [CDHIT][14] for fully unaligned sequences and partially unaligned sequences, respectively: @@ -188,17 +198,24 @@ i. The step of remove redundancy sequences is conducted by [CDHIT][14] for fully ii. Then the non-redundant sequences are aligned to NCBI’s non-redundant nucleotide database by [BLAST][15]: - hupanSLURM blastAlign blast rmRedundant rmRedundant_blast /path/to/nt /path/to/blast + mkdir nt & cd nt + wget https://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz |gunzip & cd .. + hupanSLURM blastAlign mkblastdb nt nt_index path/to/blast + mkdir rmRedundant & mv rmRedundant.fully.unaligned rmRedundant & mv rmRedundant.partially.unaligned rmRedundant + hupanSLURM blastAlign blast rmRedundant rmRedundant_blast /path/to/nt_index /path/to/blast iii. According to the alignment result, the taxonomic classification of each sequences (if have) could be obtained: - hupanSLURM getTaxClass rmRedundant_blast/ data/fully/fully.non-redundant.blast info/ TaxClass_fully - hupanSLURM getTaxClass rmRedundant_blast/ data/partially/partially.non-redundant.blast info/ TaxClass_partially + wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid + wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz & tar -zvxf new_taxdump.tar.gz + mkdir info & mv nucl_gb.accession2taxid info & mv new_taxdump/rankedlineage.dmp info + hupanSLURM getTaxClass rmRedundant_blast/data/rmRedundant.fully.unaligned/non-redundant.blast info/ TaxClass_fully + hupanSLURM getTaxClass rmRedundant_blast/data/rmRedundant.partially.unaligned/non-redundant.blast info/ TaxClass_partially iv. And the sequences classifying as microbiology and non-primate eukaryotes are considered as non-human sequences and removed from further consideration: - hupanSLURM rmCtm -i 60 rmRedundant/fully/fully.non-redundant.fa rmRedundant_blast/data/fully/fully.non-redundant.blast TaxClass_fully/data/accession.name rmCtm_fully - hupanSLURM rmCtm -i 60 rmRedundant/partially/partially.non-redundant.fa rmRedundant_blast/data/partially/partially.non-redundant.blast TaxClass_partially/data/accession.name rmCtm_partially + hupanSLURM rmCtm -i 60 rmRedundant/rmRedundant.fully.unaligned/non-redundant.fa rmRedundant_blast/data/rmRedundant.fully.unaligned/non-redundant.blast TaxClass_fully/data/accession.name rmCtm_fully + hupanSLURM rmCtm -i 60 rmRedundant/rmRedundant.partially.unaligned/non-redundant.fa rmRedundant_blast/data/rmRedundant.partially.unaligned/non-redundant.blast TaxClass_partially/data/accession.name rmCtm_partially **(6) Construction and annotation of pan-genome** @@ -219,13 +236,12 @@ iii. Then after all procedures are finished, the outcomes are merged: iv. The new predicted genes may be highly similar to the genes that are located in reference genome, and additional filtering step should be conducted to ensure the novelty of predicted gene: - hupanSLURM filterNovGen GenePre_merge GenePre_filter /path/to/reference/ /path/to/blast /path/to/cdhit /path/to/RepeatMask + hupanSLURM filterNovGene GenePre_merge GenePre_filter /path/to/reference/ /path/to/blast /path/to/cdhit /path/to/RepeatMask v. The annotation of pan-genome sequences is simply merged to obtain by combine two annotation files: - - hupanSLURM pTpG ref/ref.gtf ref/ref-ptpg.gtf - cat ref/ref-ptpg.gtf non-reference.gtf >pan/pan.gtf + hupanSLURM pTpG ref/ref.gff ref/ref-ptpg.gff + cat ref/ref-ptpg.gff non-reference.gtf >pan/pan.gff **(7) PAV analysis** @@ -243,7 +259,7 @@ ii. The result of .sam should be converted to .bam and sorted and indexed use [S iii. Then the gene body coverage and the cds coverage of each gene are calculated: - hupanSLURM geneCov panBam/data geneCov/ pan/pan.gtf + hupanSLURM geneCov panBam/data geneCov/ pan/pan.gff iv. Finally, the gene presence-absence is determined by the threshold of cds coverage as 95%: @@ -263,7 +279,7 @@ Any bugs or suggestions, please contact the [authors][20]. [3]: https://github.com/SJTU-CGM/HUPAN [4]: http://cgm.sjtu.edu.cn/hupan/download.php [5]: http://cgm.sjtu.edu.cn/eupan/ - [6]: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/HG001.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam + [6]: http://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/HG001.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam [7]: http://cgm.sjtu.edu.cn/hupan/data/hupanExample.tar.gz [8]: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [9]: http://www.usadellab.org/cms/index.php?page=trimmomatic diff --git a/hupan_cmd.sh b/hupan_cmd.sh index d68b885..714600b 100644 --- a/hupan_cmd.sh +++ b/hupan_cmd.sh @@ -1,10 +1,9 @@ #!/bin/bash IFS=' ' -complete -W "qualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta getUnalnCtg rmRedundant pTpG geneCov geneExist subSample gFamExist bam2bed fastaSta sim rmCtm blastAlign simSeq splitSeq getTaxClass genePre mergeNovGene" eupan +complete -W "qualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta getUnalnCtg rmRedundant pTpG geneCov geneExist subSample gFamExist bam2bed fastaSta sim getTaxClass rmCtm blastAlign simSeq splitSeq genePre mergeNovGene filterNovGene" hupan -complete -W "qualSta mergeQualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta mergeAssemSta getUnalnCtg mergeUnalnCtg rmRedundant pTpG geneCov mergeGeneCov geneExist subSample gFamExist bam2bed fastaSta sim rmCtm blastAlign simSeq splitSeq getTaxClass genePre mergeNovGene" eupanLSF - -complete -W "qualSta mergeQualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta mergeAssemSta getUnalnCtg mergeUnalnCtg rmRedundant pTpG geneCov mergeGeneCov geneExist subSample gFamExist bam2bed fastaSta sim rmCtm blastAlign simSeq splitSeq getTaxClass genePre mergeNovGene" eupanSLURM +complete -W "qualSta mergeQualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta mergeAssemSta getUnalnCtg mergeUnalnCtg rmRedundant pTpG geneCov mergeGeneCov geneExist subSample gFamExist bam2bed fastaSta sim getTaxClass rmCtm blastAlign simSeq splitSeq genePre mergeNovGene filterNovGene" hupanLSF +complete -W "qualSta mergeQualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta mergeAssemSta getUnalnCtg mergeUnalnCtg rmRedundant pTpG geneCov mergeGeneCov geneExist subSample gFamExist bam2bed fastaSta sim getTaxClass rmCtm blastAlign simSeq splitSeq genePre mergeNovGene filterNovGene" hupanSLURM diff --git a/lib/HUPANassem.pm b/lib/HUPANassem.pm index 040f6ea..4de5a2b 100644 --- a/lib/HUPANassem.pm +++ b/lib/HUPANassem.pm @@ -2,6 +2,8 @@ #Created by Hu Zhiqiang, 2014-9-5 #Modefied by Duan Zhongqu, 2018-7-2 #Added the function of de novo assembly by sga with lower memory. +#Modefied by Duan Zhongqu, 2020-4-25 +#Fix some bugs, for example, the suffix of sequencing data files. package assembly; @@ -19,7 +21,7 @@ Commands: if($com eq "soapdenovo"){ soap(@ARGV); } - elsif($com eq "linearK.pl"){ + elsif($com eq "linearK"){ linearK(@ARGV); } elsif($com eq "sga"){ @@ -45,29 +47,29 @@ hupan assemble soapdenovo is used to assemble high-quality reads on large scale. Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq or .fastq.gz. + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - sopadenovo_directory directory where soapdenovo2 executable files exists + sopadenovo_directory Directory where soapdenovo2 executable files exists Options: - -h Print this usage page. + -h Print this usage page. - -t Threads used. - Default: 1 + -t Threads used. + Default: 1 -s Suffix of files within data_directory. - Default: .fq.gz + Default: \".fastq.gz\" - -k Kmer. - Default: 35 + -k Kmer. + Default: 35 -c Parameters of soapdenovo2 config file. 8 parameters ligated by comma 1)maximal read length @@ -82,7 +84,7 @@ Options: (at least 32 for short insert size) Default: 80,460,0,3,80,1,3,32 - -g enable gapcloser + -g Enable gapcloser "; die $usage if @ARGV!=3; @@ -91,10 +93,7 @@ my ($data_dir,$out_dir,$tool_dir)=@ARGV; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } $tool_dir.="/" unless $tool_dir=~/\/$/; @@ -116,7 +115,7 @@ my $kmer=35; $kmer=$opt_k if defined $opt_k; #define file suffix -my $suffix=".fq.gz"; +my $suffix=".fastq.gz"; $suffix=$opt_s if defined($opt_s); my ($max_rd_len,$avg_ins,$reverse_seq,$asm_flags,$rd_len_cutoff,$rank,$pair_num_cutoff,$map_len) @@ -262,29 +261,29 @@ hupan assmble linearK is used to assemble high-quality reads on large scale. Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq or .fastq.gz. + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - sopadenovo_directory directory where soapdenovo2 executable files exists + sopadenovo_directory Directory where soapdenovo2 executable files exists Options: - -h Print this usage page. + -h Print this usage page. - -t Threads used. - Default: 1 + -t Threads used. + Default: 1 -g Genome size. Used to infer sequencing depth. - Default: 380000000 (460M) + Default: 3000000000 (3G) -s Suffix of files within data_directory. - Default: .fq.gz + Default: \".fastq.gz\" -r Parameters of linear function: Kmer=2*int(0.5*(a*Depth+b))+1. The parameter should be input as \"a,b\". @@ -328,10 +327,7 @@ my ($data_dir,$out_dir,$soapdenovo)=@ARGV; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist -."); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist\n."); } #Check executable linearK.pl my $exec="linearK.pl"; @@ -396,8 +392,8 @@ use strict; use warnings; use Cwd 'abs_path'; use Getopt::Std; -use vars qw($opt_h $opt_t $opt_d $opt_m); -getopts("ht:d:m"); +use vars qw($opt_h $opt_t $opt_d $opt_m $opt_s); +getopts("ht:d:m:s:"); my $usage="\nUsage: hupan assemble sga [options] @@ -406,12 +402,12 @@ hupan assemble sga is used to assemble high-quality reads on large scale. Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by fastq.gz. + sequencing files ended by \".fastq.gz\" or \".fq.gz\". output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -424,8 +420,11 @@ Options: -t Threads used. Default: 1. + -s Suffix of files within data_directory. + Default: \".fastq.gz\" + -d The parameter sets used for different depths of sequencing - data. We obtained two optimatial paramter sets from + data, we obtained two optimatial paramter sets from simulated data for 30-fold and 60-fold, respectively. Default: 30. @@ -441,16 +440,13 @@ my ($data_dir,$out_dir,$sga_dir)=@ARGV; $data_dir=abs_path($data_dir); $out_dir=abs_path($out_dir); -print $data_dir."\n"; -print $out_dir."\n"; +#print $data_dir."\n"; +#print $out_dir."\n"; my $cmd_dir=$ENV{'PWD'}; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist -."); + die("Error: output directory \"$out_dir\" already exists.To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Check executable sga @@ -464,6 +460,10 @@ if(defined($opt_t)){ $thread_num=$opt_t; } +#define file suffix +my $suffix=".fastq.gz"; +$suffix=$opt_s if defined($opt_s); + #define depth my $depth=30; $depth=$opt_d if defined $opt_d; @@ -499,9 +499,9 @@ foreach my $s (@sample){ foreach my $f (@files){ next if $f=~/^\.+$/; next if $f=~/^single/; - print STDERR "Warnig: $f without suffix: fastq.gz\n" unless $f=~/fastq.gz$/; - next unless $f=~/fastq.gz$/; - my $fb=substr($f,0,length($f)-length(".fastq.gz")-1); + print STDERR "Warnig: $f without suffix: $suffix\n" unless $f=~/$suffix$/; + next unless $f=~/$suffix$/; + my $fb=substr($f,0,length($f)-length($suffix)-1); #print $fb."\n"; push @fq_base, $fb; } @@ -517,8 +517,10 @@ foreach my $s (@sample){ if($length==1){ my $b=$fq_base[0]; $preprocess_fastq=$o_dir.$s.".fastq"; - $fastq1=$s_dir.$b."1.fastq.gz"; - $fastq2=$s_dir.$b."2.fastq.gz"; + $fastq1=$s_dir.$b."1".$suffix; + $fastq2=$s_dir.$b."2".$suffix; + die("Error: missed file: $fastq1\n") unless -e $fastq1; + die("Error: missed file: $fastq2\n") unless -e $fastq2; $com="$sga preprocess -o $preprocess_fastq --pe-mode 1 $fastq1 $fastq2\n"; $com.="$sga index -a ropebwt --no-reverse -t $thread_num $preprocess_fastq\n"; } @@ -526,10 +528,10 @@ foreach my $s (@sample){ my @list; my $merge_prefix=$o_dir.$s; foreach my $b (@fq_base){ - $fastq1=$s_dir.$b."1.fastq.gz"; - $fastq2=$s_dir.$b."2.fastq.gz"; - print STDERR "Warning: missed file: $fastq1\n" unless -e $fastq1; - print STDERR "Warning: missed file: $fastq2\n" unless -e $fastq2; + $fastq1=$s_dir.$b."1".$suffix; + $fastq2=$s_dir.$b."2".$suffix; + die("Error: missed file: $fastq1\n") unless -e $fastq1; + die("Error: missed file: $fastq2\n") unless -e $fastq2; $preprocess_fastq=$s_dir.$s.".fastq"; $com="$sga preprocess -o $preprocess_fastq --pe-mode 1 $fastq1 $fastq2\n"; $com.="$sga index -a ropebwt --no-reverse -t $thread_num $preprocess_fastq\n"; diff --git a/lib/HUPANassemLSF.pm b/lib/HUPANassemLSF.pm index fe47723..3b161b7 100644 --- a/lib/HUPANassemLSF.pm +++ b/lib/HUPANassemLSF.pm @@ -2,6 +2,8 @@ #Created by Hu Zhiqiang, 2014-9-5 #Modefied by Duan Zhongqu, 2018-7-2 #Added the function of de novo assembly by sga with lower memory. +#Modefied by Duan Zhongqu, 2020-4-25 +#Fix some bugs, for example, the suffix of sequencing data files. package assembly; @@ -19,7 +21,7 @@ Commands: if($com eq "soapdenovo"){ soap(@ARGV); } - elsif($com eq "linearK.pl"){ + elsif($com eq "linearK"){ linearK(@ARGV); } elsif($com eq "sga"){ @@ -45,29 +47,29 @@ hupanLSF assemble soapdenovo is used to assemble high-quality reads on large sca Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq or .fastq.gz. + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - sopadenovo_directory directory where soapdenovo2 executable files exists + sopadenovo_directory Directory where soapdenovo2 executable files exists Options: - -h Print this usage page. + -h Print this usage page. - -t Threads used. - Default: 1 + -t Threads used. + Default: 1 -s Suffix of files within data_directory. - Default: .fq.gz + Default: .fastq.gz - -k Kmer. - Default: 35 + -k Kmer. + Default: 35 -c Parameters of soapdenovo2 config file. 8 parameters ligated by comma 1)maximal read length @@ -82,7 +84,7 @@ Options: (at least 32 for short insert size) Default: 80,460,0,3,80,1,3,32 - -g enable gapcloser + -g Enable gapcloser -q The queue name for job submiting. Default: default queue @@ -94,10 +96,7 @@ my ($data_dir,$out_dir,$tool_dir)=@ARGV; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } $tool_dir.="/" unless $tool_dir=~/\/$/; @@ -119,7 +118,7 @@ my $kmer=35; $kmer=$opt_k if defined $opt_k; #define file suffix -my $suffix=".fq.gz"; +my $suffix=".fastq.gz"; $suffix=$opt_s if defined($opt_s); my ($max_rd_len,$avg_ins,$reverse_seq,$asm_flags,$rd_len_cutoff,$rank,$pair_num_cutoff,$map_len) @@ -241,29 +240,29 @@ hupanLSF assmble linearK is used to assemble high-quality reads on large scale. Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq or .fastq.gz. + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - sopadenovo_directory directory where soapdenovo2 executable files exists + sopadenovo_directory Directory where soapdenovo2 executable files exists Options: - -h Print this usage page. + -h Print this usage page. - -t Threads used. - Default: 1 + -t Threads used. + Default: 1 -g Genome size. Used to infer sequencing depth. - Default: 380000000 (460M) + Default: 3000000000 (3G) -s Suffix of files within data_directory. - Default: .fq.gz + Default: .fastq.gz -r Parameters of linear function: Kmer=2*int(0.5*(a*Depth+b))+1. The parameter should be input as \"a,b\". @@ -299,6 +298,7 @@ Options: -m The number of consecutive Ns to be broken down to contigs.This is used in the process break gapclosed scaffolds to contigs. Default: 10. + -q The queue name for job submiting. Default: default queue "; @@ -309,10 +309,7 @@ my ($data_dir,$out_dir,$soapdenovo)=@ARGV; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist -."); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Check executable linearK.pl my $exec="linearK.pl"; @@ -405,8 +402,8 @@ use strict; use warnings; use Cwd 'abs_path'; use Getopt::Std; -use vars qw($opt_h $opt_t $opt_d $opt_q $opt_m); -getopts("ht:d:q:m:"); +use vars qw($opt_h $opt_t $opt_d $opt_q $opt_m $opt_s); +getopts("ht:d:q:m:s:"); my $usage="\nUsage: hupanLSF assemble sga [options] @@ -415,12 +412,12 @@ hupanLSF assemble sga is used to assemble high-quality reads on large scale. Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by fastq.gz. + sequencing files ended by \".fastq.gz\" or \".fq.gz\". output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -428,18 +425,21 @@ Necessary input description: sga_directory Directory where sga executable files exists Options: - -h Print this usage page. + -h Print this usage page. + + -t Threads used. + Default: 1. - -t Threads used. - Default: 1. + -s Suffix of files within data_directory. + Default: \".fastq.gz\" - -d The parameter sets used for different depths of sequencing - data. We obtained two optimatial paramter sets from - simulated data for 30-fold and 60-fold, respectively. - Default: 30. + -d The parameter sets used for different depths of sequencing + data, we obtained two optimatial paramter sets from + simulated data for 30-fold and 60-fold, respectively. + Default: 30. - -q The queue name for job submiting. - Default: default queue + -q The queue name for job submiting. + Default: default queue -m The intermediate results are huge and we kindly suggested delete them after finishing each steps. If you want to keep @@ -452,16 +452,13 @@ my ($data_dir,$out_dir,$sga_dir)=@ARGV; $data_dir=abs_path($data_dir); $out_dir=abs_path($out_dir); -print $data_dir."\n"; -print $out_dir."\n"; +#print $data_dir."\n"; +#print $out_dir."\n"; my $cmd_dir=$ENV{'PWD'}; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist -."); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Check executable sga @@ -475,6 +472,10 @@ if(defined($opt_t)){ $thread_num=$opt_t; } +#define file suffix +my $suffix=".fastq.gz"; +$suffix=$opt_s if defined($opt_s); + #define depth my $depth=30; $depth=$opt_d if defined $opt_d; @@ -510,9 +511,9 @@ foreach my $s (@sample){ foreach my $f (@files){ next if $f=~/^\.+$/; next if $f=~/^single/; - print STDERR "Warnig: $f without suffix: fastq.gz\n" unless $f=~/fastq.gz$/; - next unless $f=~/fastq.gz$/; - my $fb=substr($f,0,length($f)-length(".fastq.gz")-1); + print STDERR "Warnig: $f without suffix: $suffix\n" unless $f=~/$suffix$/; + next unless $f=~/$suffix$/; + my $fb=substr($f,0,length($f)-length($suffix)-1); #print $fb."\n"; push @fq_base, $fb; } @@ -528,8 +529,10 @@ foreach my $s (@sample){ if($length==1){ my $b=$fq_base[0]; $preprocess_fastq=$o_dir.$s.".fastq"; - $fastq1=$s_dir.$b."1.fastq.gz"; - $fastq2=$s_dir.$b."2.fastq.gz"; + $fastq1=$s_dir.$b."1".$suffix; + $fastq2=$s_dir.$b."2".$suffix; + die("Error: missed file: $fastq1\n") unless -e $fastq1; + die("Error: missed file: $fastq2\n") unless -e $fastq2; $com="$sga preprocess -o $preprocess_fastq --pe-mode 1 $fastq1 $fastq2\n"; $com.="$sga index -a ropebwt --no-reverse -t $thread_num $preprocess_fastq\n"; } @@ -537,10 +540,10 @@ foreach my $s (@sample){ my @list; my $merge_prefix=$o_dir.$s; foreach my $b (@fq_base){ - $fastq1=$s_dir.$b."1.fastq.gz"; - $fastq2=$s_dir.$b."2.fastq.gz"; - print STDERR "Warning: missed file: $fastq1\n" unless -e $fastq1; - print STDERR "Warning: missed file: $fastq2\n" unless -e $fastq2; + $fastq1=$s_dir.$b."1".$suffix; + $fastq2=$s_dir.$b."2".$suffix; + die("Error: missed file: $fastq1\n") unless -e $fastq1; + die("Error: missed file: $fastq2\n") unless -e $fastq2; $preprocess_fastq=$s_dir.$s.".fastq"; $com="$sga preprocess -o $preprocess_fastq --pe-mode 1 $fastq1 $fastq2\n"; $com.="$sga index -a ropebwt --no-reverse -t $thread_num $preprocess_fastq\n"; diff --git a/lib/HUPANassemSLURM.pm b/lib/HUPANassemSLURM.pm index 78e2459..72805b1 100644 --- a/lib/HUPANassemSLURM.pm +++ b/lib/HUPANassemSLURM.pm @@ -2,6 +2,9 @@ #Created by Hu Zhiqiang, 2014-9-5 #Modefied by Duan Zhongqu, 2018-7-2 #Added the function of de novo assembly by sga with lower memory. +#Modefied by Duan Zhongqu, 2020-4-25 +#Fix some bugs, for example, the suffix of sequencing data files. + package assembly; sub assemble{ @@ -18,7 +21,7 @@ Commands: if($com eq "soapdenovo"){ soap(@ARGV); } - elsif($com eq "linearK.pl"){ + elsif($com eq "linearK"){ linearK(@ARGV); } elsif($com eq "sga"){ @@ -44,29 +47,29 @@ hupanSLURM assemble soapdenovo is used to assemble high-quality reads on large s Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq or .fastq.gz. + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - sopadenovo_directory directory where soapdenovo2 executable files exists + sopadenovo_directory Directory where soapdenovo2 executable files exists Options: - -h Print this usage page. + -h Print this usage page. - -t Threads used. - Default: 1 + -t Threads used. + Default: 1 -s Suffix of files within data_directory. - Default: .fq.gz + Default: \".fastq.gz\" - -k Kmer. - Default: 35 + -k Kmer. + Default: 35 -c Parameters of soapdenovo2 config file. 8 parameters ligated by comma 1)maximal read length @@ -81,7 +84,7 @@ Options: (at least 32 for short insert size) Default: 80,460,0,3,80,1,3,32 - -g enable gapcloser + -g Enable gapcloser -q The queue name for job submiting. Default: default queue @@ -93,10 +96,7 @@ my ($data_dir,$out_dir,$tool_dir)=@ARGV; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } $tool_dir.="/" unless $tool_dir=~/\/$/; @@ -118,7 +118,7 @@ my $kmer=35; $kmer=$opt_k if defined $opt_k; #define file suffix -my $suffix=".fq.gz"; +my $suffix=".fastq.gz"; $suffix=$opt_s if defined($opt_s); my ($max_rd_len,$avg_ins,$reverse_seq,$asm_flags,$rd_len_cutoff,$rank,$pair_num_cutoff,$map_len) @@ -241,29 +241,29 @@ hupanSLURM assmble linearK is used to assemble high-quality reads on large scale Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several sequencing files ended by .fastq or .fastq.gz. output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - sopadenovo_directory directory where soapdenovo2 executable files exists + sopadenovo_directory Directory where soapdenovo2 executable files exists Options: - -h Print this usage page. + -h Print this usage page. - -t Threads used. - Default: 1 + -t Threads used. + Default: 1 -g Genome size. Used to infer sequencing depth. - Default: 380000000 (460M) + Default: 3000000000 (3G) -s Suffix of files within data_directory. - Default: .fq.gz + Default: \".fastq.gz\" -r Parameters of linear function: Kmer=2*int(0.5*(a*Depth+b))+1. The parameter should be input as \"a,b\". @@ -299,6 +299,7 @@ Options: -m The number of consecutive Ns to be broken down to contigs.This is used in the process break gapclosed scaffolds to contigs. Default: 10. + -q The queue name for job submiting. Default: default queue "; @@ -309,10 +310,7 @@ my ($data_dir,$out_dir,$soapdenovo)=@ARGV; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist -."); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Check executable linearK.pl my $exec="linearK.pl"; @@ -405,8 +403,8 @@ use strict; use warnings; use Cwd 'abs_path'; use Getopt::Std; -use vars qw($opt_h $opt_t $opt_d $opt_q $opt_m); -getopts("ht:d:q:m:"); +use vars qw($opt_h $opt_t $opt_d $opt_q $opt_m $opt_s); +getopts("ht:d:q:m:s:"); my $usage="\nUsage: hupan assemble sga [options] @@ -415,12 +413,12 @@ hupan assemble sga is used to assemble high-quality reads on large scale. Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by fastq.gz. + sequencing files ended by \".fastq.gz\" or \".fq.gz\". output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -433,8 +431,11 @@ Options: -t Threads used. Default: 1. + -s Suffix of files within data_directory. + Default: \".fastq.gz\" + -d The parameter sets used for different depths of sequencing - data. We obtained two optimatial paramter sets from + data, we obtained two optimatial paramter sets from simulated data for 30-fold and 60-fold, respectively. Default: 30. @@ -452,16 +453,13 @@ my ($data_dir,$out_dir,$sga_dir)=@ARGV; $data_dir=abs_path($data_dir); $out_dir=abs_path($out_dir); -print $data_dir."\n"; -print $out_dir."\n"; +#print $data_dir."\n"; +#print $out_dir."\n"; my $cmd_dir=$ENV{'PWD'}; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist -."); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Check executable sga @@ -479,6 +477,10 @@ if(defined($opt_t)){ my $depth=30; $depth=$opt_d if defined $opt_d; +#define file suffix +my $suffix=".fastq.gz"; +$suffix=$opt_s if defined($opt_s); + #whether keep the intermediate results my $keep=0; $keep=$opt_m if defined $opt_m; @@ -510,9 +512,9 @@ foreach my $s (@sample){ foreach my $f (@files){ next if $f=~/^\.+$/; next if $f=~/^single/; - print STDERR "Warnig: $f without suffix: fastq.gz\n" unless $f=~/fastq.gz$/; - next unless $f=~/fastq.gz$/; - my $fb=substr($f,0,length($f)-length(".fastq.gz")-1); + print STDERR "Warnig: $f without suffix: $suffix\n" unless $f=~/$suffix$/; + next unless $f=~/$suffix$/; + my $fb=substr($f,0,length($f)-length($suffix)-1); #print $fb."\n"; push @fq_base, $fb; } @@ -528,8 +530,10 @@ foreach my $s (@sample){ if($length==1){ my $b=$fq_base[0]; $preprocess_fastq=$o_dir.$s.".fastq"; - $fastq1=$s_dir.$b."1.fastq.gz"; - $fastq2=$s_dir.$b."2.fastq.gz"; + $fastq1=$s_dir.$b."1".$suffix; + $fastq2=$s_dir.$b."2".$suffix; + die("Error: missed file: $fastq1\n") unless -e $fastq1; + die("Error: missed file: $fastq2\n") unless -e $fastq2; $com="$sga preprocess -o $preprocess_fastq --pe-mode 1 $fastq1 $fastq2\n"; $com.="$sga index -a ropebwt --no-reverse -t $thread_num $preprocess_fastq\n"; } @@ -537,10 +541,10 @@ foreach my $s (@sample){ my @list; my $merge_prefix=$o_dir.$s; foreach my $b (@fq_base){ - $fastq1=$s_dir.$b."1.fastq.gz"; - $fastq2=$s_dir.$b."2.fastq.gz"; - print STDERR "Warning: missed file: $fastq1\n" unless -e $fastq1; - print STDERR "Warning: missed file: $fastq2\n" unless -e $fastq2; + $fastq1=$s_dir.$b."1".$suffix; + $fastq2=$s_dir.$b."2".$suffix; + die("Error: missed file: $fastq1\n") unless -e $fastq1; + die("Error: missed file: $fastq2\n") unless -e $fastq2; $preprocess_fastq=$s_dir.$s.".fastq"; $com="$sga preprocess -o $preprocess_fastq --pe-mode 1 $fastq1 $fastq2\n"; $com.="$sga index -a ropebwt --no-reverse -t $thread_num $preprocess_fastq\n"; diff --git a/lib/HUPANassemSta.pm b/lib/HUPANassemSta.pm index 3cc0f01..e12d614 100644 --- a/lib/HUPANassemSta.pm +++ b/lib/HUPANassemSta.pm @@ -18,12 +18,12 @@ The script will call QUAST program, so the directory where quast.py locates is n Necessary input description: assembly_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, assembly results, including - files *.scafSeq and *.contig, should exist. + files \"*.scafSeq\" and \"*.contig\", should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -43,7 +43,7 @@ Options: -g Check the statistics of gap-closed assemblies if -g is enabled. In the assembly directory of each sample, - *_gc.scafSeq and *_gc.contig should exist. + \"*_gc.scafSeq\" and \"*_gc.contig\" should exist. Default: check statistics of raw assemblies -s Check the statistics of assembled scaffolds if -s is enabled. @@ -67,10 +67,7 @@ die("Error02: Cannot find reference sequence file\n") unless(-e $ref); #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads diff --git a/lib/HUPANassemStaLSF.pm b/lib/HUPANassemStaLSF.pm index 4f2306f..4a3219f 100644 --- a/lib/HUPANassemStaLSF.pm +++ b/lib/HUPANassemStaLSF.pm @@ -18,12 +18,12 @@ The script will call QUAST program, so the directory where quast.py locates is n Necessary input description: assembly_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, assembly results, including - files *_gc.scafSeq and *_gc.contig, should exist. + files \"*_gc.scafSeq\" and \"*.contig\", should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -43,7 +43,7 @@ Options: -g Check the statistics of gap-closed assemblies if -g is enabled. In the assembly directory of each sample, - *_gc.scafSeq and *_gc.contig should exist. + \"*_gc.scafSeq\" and \"*_gc.contig\" should exist. Default: check statistics of raw assemblies -s Check the statistics of assembled scaffolds if -s is enabled. @@ -67,10 +67,7 @@ die("Error02: Cannot find reference sequence file\n") unless(-e $ref); #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads @@ -175,11 +172,11 @@ hupan LSF mergeAssemSta is used to collect statistices info of assembly from qua Necessary input description: - QUAST_output_directory_list One or more of quast . + QUAST_output_directory_list One or more of quast output results. - unaligned_contig_list File including a list of names of unaligned contigs. - In each directory, there should be sub directories - named by the sample names. + unaligned_contig_list File including a list of names of unaligned contigs. + In each directory, there should be sub directories + named by the sample names. "; die $usage if @ARGV<1; diff --git a/lib/HUPANassemStaSLURM.pm b/lib/HUPANassemStaSLURM.pm index b1b40a5..e3ce414 100644 --- a/lib/HUPANassemStaSLURM.pm +++ b/lib/HUPANassemStaSLURM.pm @@ -18,12 +18,12 @@ The script will call QUAST program, so the directory where quast.py locates is n Necessary input description: assembly_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, assembly results, including - files *_gc.scafSeq and *_gc.contig, should exist. + files \"*_gc.scafSeq\" and \"*.contig\", should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -43,7 +43,7 @@ Options: -g Check the statistics of gap-closed assemblies if -g is enabled. In the assembly directory of each sample, - *_gc.scafSeq and *_gc.contig should exist. + \"*_gc.scafSeq\" and \"*_gc.contig\" should exist. Default: check statistics of raw assemblies -s Check the statistics of assembled scaffolds if -s is enabled. @@ -67,10 +67,7 @@ die("Error02: Cannot find reference sequence file\n") unless(-e $ref); #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads @@ -176,11 +173,11 @@ hupan SLURM mergeAssemSta is used to collect statistices info of assembly from q Necessary input description: - QUAST_output_directory_list One or more of quast . + QUAST_output_directory_list One or more of quast output results. - unaligned_contig_list File including a list of names of unaligned contigs. - In each directory, there should be sub directories - named by the sample names. + unaligned_contig_list File including a list of names of unaligned contigs. + In each directory, there should be sub directories + named by the sample names. "; die $usage if @ARGV<1; diff --git a/lib/HUPANbam2bed.pm b/lib/HUPANbam2bed.pm index 251c4e1..344be3b 100644 --- a/lib/HUPANbam2bed.pm +++ b/lib/HUPANbam2bed.pm @@ -5,20 +5,20 @@ package bam2cov; sub bam2bed{ use strict; use warnings; -my $usage="\nUsage: hupan bam2bed [options] +my $usage="\nUsage: hupan bam2bed This tool is used to calculate the covered region of the genome. -The outputs are covered fragments without overlap in 3-column .bed format. +The outputs are covered fragments without overlap in 3-column \".bed\" format. Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -46,10 +46,7 @@ die("Executable bam2cov cannot be found in your PATH!\n #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads diff --git a/lib/HUPANbam2bedLSF.pm b/lib/HUPANbam2bedLSF.pm index baa6cdc..b4cc35a 100644 --- a/lib/HUPANbam2bedLSF.pm +++ b/lib/HUPANbam2bedLSF.pm @@ -2,35 +2,39 @@ use strict; use warnings; package bam2cov; use Getopt::Std; -use vars qw($opt_q); -getopts("q:"); +use vars qw($opt_q $opt_h); +getopts("q:h"); sub bam2bed{ use strict; use warnings; -my $usage="\nUsage: hupanLSF bam2bed [options] +my $usage="\nUsage: hupanLSF bam2bed [options] This tool is used to calculate the covered region of the genome. -The outputs are covered fragments without overlap in 3-column .bed format. +The outputs are covered fragments without overlap in 3-column \".bed\" format. Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - -q The queue name for job submiting. - default: default queue +Options: + -h Print this usage page. + + -q The queue name for job submiting. + Default: default queue "; die $usage if @ARGV<2; +die $usage if defined($opt_h); my ($data_dir,$out_dir)=@ARGV; #detect bam2cov @@ -51,10 +55,7 @@ die("Executable bam2cov cannot be found in your PATH!\n #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads @@ -112,7 +113,7 @@ foreach my $s (@sample){ #create job script open(JOB,">$job_file")||die("Error05: Unable to create job file: $job_file\n"); print JOB "\#BSUB -J $s","_bam2bed\n"; #job name - print JOB "\#BSUB -q $opt_q\n" if defined $opt_q; #queue name in the submission system + print JOB "\#BSUB -q $opt_q\n" if defined $opt_q; #queue name in the submission system print JOB "\#BSUB -o $out_file\n"; #stdout print JOB "\#BSUB -e $err_file\n"; #stderr print JOB "\#BSUB -n $thread_num\n"; #thread number @@ -120,8 +121,6 @@ foreach my $s (@sample){ close JOB; system("bsub <$job_file"); #submit job #***************************************************************************************** - - } } 1; diff --git a/lib/HUPANbam2bedSLURM.pm b/lib/HUPANbam2bedSLURM.pm index 55ce9df..813b8d8 100644 --- a/lib/HUPANbam2bedSLURM.pm +++ b/lib/HUPANbam2bedSLURM.pm @@ -2,35 +2,39 @@ use strict; use warnings; package bam2cov; use Getopt::Std; -use vars qw($opt_q); -getopts("q:"); +use vars qw($opt_q $opt_h); +getopts("q:h"); sub bam2bed{ use strict; use warnings; -my $usage="\nUsage: hupanSLURM bam2bed [options] +my $usage="\nUsage: hupanSLURM bam2bed [options] This tool is used to calculate the covered region of the genome. -The outputs are covered fragments without overlap in 3-column .bed format. +The outputs are covered fragments without overlap in 3-column \".bed\" format. Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - -q The queue name for job submiting. - default: default queue +Options: + -h Print this usage page. + + -q The queue name for job submiting. + Default: default queue "; die $usage if @ARGV<2; +die $usage if defined($opt_h); my ($data_dir,$out_dir)=@ARGV; #detect bam2cov @@ -45,16 +49,12 @@ foreach my $p (@path){ last; } } -die("Executable bam2cov cannot be found in your PATH!\n -") unless($fpflag); +die("Executable bam2cov cannot be found in your PATH!\n") unless($fpflag); #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads @@ -91,16 +91,16 @@ foreach my $s (@sample){ my $sd=$data_dir.$s."/"; next unless(-d $sd); print STDERR "Process sample $sd\n"; -#obtain *.bam file within the sample directory + #obtain *.bam file within the sample directory my $bam_file=$sd.$s.".bam"; unless(-e $bam_file){ - print STDERR "Warnings: cannot find bam file($bam_file) in $sd: skip this sample\n"; - next; + print STDERR "Warnings: cannot find bam file($bam_file) in $sd: skip this sample\n"; + next; } -#create output directory for a sample + #create output directory for a sample my $sample_out.=$out_data."$s.bed"; -#generate command + #generate command my $com; $com="$exec $bam_file >$sample_out\n"; @@ -121,8 +121,6 @@ foreach my $s (@sample){ close JOB; system("sbatch $job_file"); #submit job #***************************************************************************************** - - } } 1; diff --git a/lib/HUPANbamSta.pm b/lib/HUPANbamSta.pm index 7ad25e2..68a6f93 100644 --- a/lib/HUPANbamSta.pm +++ b/lib/HUPANbamSta.pm @@ -41,12 +41,12 @@ The script will call bam_stats (in BamUtil), so the directory where bamUtil loca Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample1,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -68,10 +68,7 @@ die("Error01: Cannot find bam_stats file in directory bin/ under $bamutil_dir\n" #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads @@ -135,12 +132,12 @@ The script will call qualimap, so the directory where qualimap locates is needed Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -174,10 +171,7 @@ die("Error01: Cannot find qualimap file in directory bin/ under $qualimap_dir\n" #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads diff --git a/lib/HUPANbamStaLSF.pm b/lib/HUPANbamStaLSF.pm index c013cf3..79d41aa 100644 --- a/lib/HUPANbamStaLSF.pm +++ b/lib/HUPANbamStaLSF.pm @@ -47,12 +47,12 @@ The script will call bam_stats (in BamUtil), so the directory where bamUtil loca Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -61,8 +61,8 @@ Necessary input description: Options: - -q The queue name for job submiting. - default: default queue + -q The queue name for job submiting. + Default: default queue "; @@ -79,10 +79,7 @@ die("Error01: Cannot find bam_stats file in directory bin/ under $bamutil_dir #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads @@ -169,12 +166,12 @@ The script will call qualimap, so the directory where qualimap locates is needed Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -209,10 +206,7 @@ die("Error01: Cannot find qualimap file in directory bin/ under $qualimap_dir #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads @@ -300,7 +294,7 @@ hupanLSF bamSta mergeBasicSta is used to collect and merge bam basic statistics. Necessary input description: - directory directories of bamUtil results. + directory Directories of bamUtil results. Each directory contains sub-directories named by sample names. data/ directory of bamSta basic output. @@ -377,7 +371,7 @@ hupanLSF bamSta mergeCovSta is used to collect and merge bam coverage statistics Necessary input description: - directory directories of qualimap results. + directory Directories of qualimap results. Each directory contains sub-directories named by sample names. diff --git a/lib/HUPANbamStaSLURM.pm b/lib/HUPANbamStaSLURM.pm index 001841f..1640c69 100644 --- a/lib/HUPANbamStaSLURM.pm +++ b/lib/HUPANbamStaSLURM.pm @@ -47,12 +47,12 @@ The script will call bam_stats (in BamUtil), so the directory where bamUtil loca Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -61,8 +61,8 @@ Necessary input description: Options: - -q The queue name for job submiting. - default: default queue + -q The queue name for job submiting. + Default: default queue "; @@ -79,10 +79,7 @@ die("Error01: Cannot find bam_stats file in directory bin/ under $bamutil_dir #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads @@ -170,12 +167,12 @@ The script will call qualimap, so the directory where qualimap locates is needed Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -210,10 +207,7 @@ die("Error01: Cannot find qualimap file in directory bin/ under $qualimap_dir #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads @@ -302,7 +296,7 @@ hupanSLURM bamSta mergeBasicSta is used to collect and merge bam basic statistic Necessary input description: - directory directories of bamUtil results. + directory Directories of bamUtil results. Each directory contains sub-directories named by sample names. data/ directory of bamSta basic output. @@ -379,7 +373,7 @@ hupanSLURM bamSta mergeCovSta is used to collect and merge bam coverage statisti Necessary input description: - directory directories of qualimap results. + directory Directories of qualimap results. Each directory contains sub-directories named by sample names. diff --git a/lib/HUPANblastAlign.pm b/lib/HUPANblastAlign.pm index b92f5cb..e634ca0 100644 --- a/lib/HUPANblastAlign.pm +++ b/lib/HUPANblastAlign.pm @@ -66,7 +66,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the\n output directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the\n output directory should not exist.\n"); } $out_dir.="/" unless $out_dir=~/\/$/; mkdir $out_dir; @@ -109,9 +109,9 @@ hupan blastAlign blast is used to align the sequences from multiple individual i Necessary input description: data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152, etc. + named by sample names, such as Sample1, Sample2, etc. In each sub-directory, there should be several - sequencing files ended by .fa or .fa.gz. + files ended by \".fa\" or \".fa.gz\". output_directory Output directory. @@ -136,7 +136,7 @@ Options: Default: 10-5 -s Suffix of files within data_directory. - Default: .fa + Default: \".fa\" "; @@ -146,7 +146,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the\n output directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the\n output directory should not exist.\n"); } #get the sequence type diff --git a/lib/HUPANblastAlignLSF.pm b/lib/HUPANblastAlignLSF.pm index 6264172..8421fce 100644 --- a/lib/HUPANblastAlignLSF.pm +++ b/lib/HUPANblastAlignLSF.pm @@ -66,7 +66,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the\n output directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the\n output directory should not exist.\n"); } $out_dir.="/" unless $out_dir=~/\/$/; mkdir $out_dir; @@ -132,9 +132,9 @@ hupanLSF blastAlign blast is used to align the sequences from multiple individua Necessary input description: data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152, etc. - In each sub-directory, there should be several - sequencing files ended by .fa or .fa.gz. + named by sample names, such as Sample1, Sample2, etc. + In each sub-directory, there should be several files + ended by \".fa\" or \".fa.gz\". output_directory Output directory. @@ -159,7 +159,7 @@ Options: Default: 10-5 -s Suffix of files within data_directory. - Default: .fa + Default: \".fa\" "; @@ -169,7 +169,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the\n output directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the\n output directory should not exist.\n"); } #get the sequence type diff --git a/lib/HUPANblastAlignSLURM.pm b/lib/HUPANblastAlignSLURM.pm index 38e9635..f78c0ba 100644 --- a/lib/HUPANblastAlignSLURM.pm +++ b/lib/HUPANblastAlignSLURM.pm @@ -66,7 +66,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the\n output directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the\n output directory should not exist.\n"); } $out_dir.="/" unless $out_dir=~/\/$/; mkdir $out_dir; @@ -133,9 +133,9 @@ hupanSLURM blastAlign blast is used to align the sequences from multiple individ Necessary input description: data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152, etc. + named by sample names, such as Sample1, Sample2, etc. In each sub-directory, there should be several - sequencing files ended by .fa or .fa.gz. + files ended by \".fa\" or \".fa.gz\". output_directory Output directory. @@ -170,7 +170,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the\n output directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the\n output directory should not exist.\n"); } #get the sequence type diff --git a/lib/HUPANfilterNovGene.pm b/lib/HUPANfilterNovGene.pm index 5e95d7b..b8094b0 100644 --- a/lib/HUPANfilterNovGene.pm +++ b/lib/HUPANfilterNovGene.pm @@ -26,7 +26,7 @@ Necessary descriptions: blast_dir The directory of blastn and makeblastdb locates. - cd-hit_dir The directory of cd-hit locates. + cd-hit_dir The directory of cd-hit-est locates. RepeatMask_dir The directory of RepeatMask locates. @@ -50,7 +50,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANfilterNovGeneLSF.pm b/lib/HUPANfilterNovGeneLSF.pm index 54e4d6d..34ddd1d 100644 --- a/lib/HUPANfilterNovGeneLSF.pm +++ b/lib/HUPANfilterNovGeneLSF.pm @@ -26,7 +26,7 @@ Necessary descriptions: blast_dir The directory of blastn and makeblastdb locates. - cd-hit_dir The directory of cd-hit locates. + cd-hit_dir The directory of cd-hit-est locates. RepeatMask_dir The directory of RepeatMask locates. @@ -50,7 +50,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANfilterNovGeneSLURM.pm b/lib/HUPANfilterNovGeneSLURM.pm index 5cdc9c4..274d7ae 100644 --- a/lib/HUPANfilterNovGeneSLURM.pm +++ b/lib/HUPANfilterNovGeneSLURM.pm @@ -26,7 +26,7 @@ Necessary descriptions: blast_dir The directory of blastn and makeblastdb locates. - cd-hit_dir The directory of cd-hit locates. + cd-hit_dir The directory of cd-hit-est locates. RepeatMask_dir The directory of RepeatMask locates. @@ -50,7 +50,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANgeneCov.pm b/lib/HUPANgeneCov.pm index 60722ed..7ecb1cc 100644 --- a/lib/HUPANgeneCov.pm +++ b/lib/HUPANgeneCov.pm @@ -16,17 +16,17 @@ The script will call samtools and ccov. Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - gene_annotation gene annotations in a single gtf file + gene_annotation Gene annotations in a single gtf file Options: -h Print this usage page. @@ -56,10 +56,7 @@ die("ccov cannot be found in your PATH!\n #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads diff --git a/lib/HUPANgeneCovLSF.pm b/lib/HUPANgeneCovLSF.pm index 5521682..d4b1eb0 100644 --- a/lib/HUPANgeneCovLSF.pm +++ b/lib/HUPANgeneCovLSF.pm @@ -17,17 +17,17 @@ The script will call samtools and ccov. Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - gene_annotation gene annotations in a single gtf file + gene_annotation Gene annotations in a single gtf file Options: -h Print this usage page. @@ -59,10 +59,7 @@ die("ccov cannot be found in your PATH!\n #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads diff --git a/lib/HUPANgeneCovSLURM.pm b/lib/HUPANgeneCovSLURM.pm index 8e702f3..8ed00dc 100644 --- a/lib/HUPANgeneCovSLURM.pm +++ b/lib/HUPANgeneCovSLURM.pm @@ -17,17 +17,17 @@ The script will call samtools and ccov. Necessary input description: bam_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, mapping result, a sorted .bam + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, mapping result, a sorted \".bam\" file, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - gene_annotation gene annotations in a single gtf file + gene_annotation Gene annotations in a single gtf file Options: -h Print this usage page. @@ -59,10 +59,7 @@ die("ccov cannot be found in your PATH!\n #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads diff --git a/lib/HUPANgeneExist.pm b/lib/HUPANgeneExist.pm index 714257c..ff2f96a 100644 --- a/lib/HUPANgeneExist.pm +++ b/lib/HUPANgeneExist.pm @@ -72,6 +72,7 @@ my @n; open(IN,$matrix); while(){ + chomp; $i++; my @t=split /\t/,$_; if($i==1){ @@ -151,7 +152,7 @@ while(){ close EX; print STDERR "Gene family number:",scalar(@group),"\n"; print STDERR "Gene number:",scalar(keys(%geneExt)),"\n"; -print STDERR "Rice line number:",scalar(@riceline),"\n"; +print STDERR "Genome number:",scalar(@riceline),"\n"; ############ end ############ ###### gene fam exist ####### diff --git a/lib/HUPANgeneExistLSF.pm b/lib/HUPANgeneExistLSF.pm index b2ab3f6..c86a73a 100644 --- a/lib/HUPANgeneExistLSF.pm +++ b/lib/HUPANgeneExistLSF.pm @@ -72,6 +72,7 @@ my @n; open(IN,$matrix); while(){ + chomp; $i++; my @t=split /\t/,$_; if($i==1){ @@ -152,7 +153,7 @@ while(){ close EX; print STDERR "Gene family number:",scalar(@group),"\n"; print STDERR "Gene number:",scalar(keys(%geneExt)),"\n"; -print STDERR "Rice line number:",scalar(@riceline),"\n"; +print STDERR "Genome number:",scalar(@riceline),"\n"; ############ end ############ ###### gene fam exist ####### diff --git a/lib/HUPANgeneExistSLURM.pm b/lib/HUPANgeneExistSLURM.pm index ec1cd75..81324e1 100644 --- a/lib/HUPANgeneExistSLURM.pm +++ b/lib/HUPANgeneExistSLURM.pm @@ -152,7 +152,7 @@ while(){ close EX; print STDERR "Gene family number:",scalar(@group),"\n"; print STDERR "Gene number:",scalar(keys(%geneExt)),"\n"; -print STDERR "Rice line number:",scalar(@riceline),"\n"; +print STDERR "Genome number:",scalar(@riceline),"\n"; ############ end ############ ###### gene fam exist ####### diff --git a/lib/HUPANgenePre.pm b/lib/HUPANgenePre.pm index beaa486..ece3e9f 100644 --- a/lib/HUPANgenePre.pm +++ b/lib/HUPANgenePre.pm @@ -39,7 +39,7 @@ Options: #check existense of output directory # if(-e $out_dir){ -# die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); +# die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); # } #adjust directory names and create output directory diff --git a/lib/HUPANgenePreLSF.pm b/lib/HUPANgenePreLSF.pm index 40e1697..c37d034 100644 --- a/lib/HUPANgenePreLSF.pm +++ b/lib/HUPANgenePreLSF.pm @@ -39,7 +39,7 @@ Options: #check existense of output directory # if(-e $out_dir){ -# die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); +# die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); # } #adjust directory names and create output directory diff --git a/lib/HUPANgenePreSLURM.pm b/lib/HUPANgenePreSLURM.pm index 2e7b336..cefe4cc 100644 --- a/lib/HUPANgenePreSLURM.pm +++ b/lib/HUPANgenePreSLURM.pm @@ -39,7 +39,7 @@ Options: #check existense of output directory # if(-e $out_dir){ -# die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); +# die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); # } #adjust directory names and create output directory diff --git a/lib/HUPANgetTaxClass.pm b/lib/HUPANgetTaxClass.pm index ea94d19..40c2164 100644 --- a/lib/HUPANgetTaxClass.pm +++ b/lib/HUPANgetTaxClass.pm @@ -37,7 +37,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANgetTaxClassLSF.pm b/lib/HUPANgetTaxClassLSF.pm index 2574838..1aa95cc 100644 --- a/lib/HUPANgetTaxClassLSF.pm +++ b/lib/HUPANgetTaxClassLSF.pm @@ -37,7 +37,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANgetTaxClassSLURM.pm b/lib/HUPANgetTaxClassSLURM.pm index 6568400..8a1e5ef 100644 --- a/lib/HUPANgetTaxClassSLURM.pm +++ b/lib/HUPANgetTaxClassSLURM.pm @@ -37,7 +37,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANmap.pm b/lib/HUPANmap.pm index 066b0ed..9d7872c 100644 --- a/lib/HUPANmap.pm +++ b/lib/HUPANmap.pm @@ -17,12 +17,12 @@ The script will call mapping program (bwa mem or bowtie2), so the directory wher Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several sequencing files ended by .fq(.gz) or .fastq(.gz). output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -67,7 +67,7 @@ my ($data_dir,$out_dir,$map_dir,$map_index)=@ARGV; #Check existence of output directory if(-e $out_dir){ die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the +To avoid overwriting of existing files, we kindly request that the output directory should not exist. "); } diff --git a/lib/HUPANmapLSF.pm b/lib/HUPANmapLSF.pm index d3c2a2e..1a8a611 100644 --- a/lib/HUPANmapLSF.pm +++ b/lib/HUPANmapLSF.pm @@ -17,12 +17,12 @@ The script will call mapping program (bwa mem or bowtie2), so the directory wher Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several sequencing files ended by .fq(.gz) or .fastq(.gz). output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -71,7 +71,7 @@ my ($data_dir,$out_dir,$map_dir,$map_index)=@ARGV; #Check existence of output directory if(-e $out_dir){ die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the +To avoid overwriting of existing files, we kindly request that the output directory should not exist."); } diff --git a/lib/HUPANmapSLURM.pm b/lib/HUPANmapSLURM.pm index 69f9b3b..dd8521b 100644 --- a/lib/HUPANmapSLURM.pm +++ b/lib/HUPANmapSLURM.pm @@ -17,12 +17,12 @@ The script will call mapping program (bwa mem or bowtie2), so the directory wher Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several sequencing files ended by .fq(.gz) or .fastq(.gz). output_directory Alignment results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -71,7 +71,7 @@ my ($data_dir,$out_dir,$map_dir,$map_index)=@ARGV; #Check existence of output directory if(-e $out_dir){ die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the +To avoid overwriting of existing files, we kindly request that the output directory should not exist."); } diff --git a/lib/HUPANmergeNovGene.pm b/lib/HUPANmergeNovGene.pm index ec20338..97b0b68 100644 --- a/lib/HUPANmergeNovGene.pm +++ b/lib/HUPANmergeNovGene.pm @@ -40,7 +40,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANmergeNovGeneLSF.pm b/lib/HUPANmergeNovGeneLSF.pm index 5b42318..f39208d 100644 --- a/lib/HUPANmergeNovGeneLSF.pm +++ b/lib/HUPANmergeNovGeneLSF.pm @@ -40,7 +40,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANmergeNovGeneSLURM.pm b/lib/HUPANmergeNovGeneSLURM.pm index 61811f4..4a1af89 100644 --- a/lib/HUPANmergeNovGeneSLURM.pm +++ b/lib/HUPANmergeNovGeneSLURM.pm @@ -40,7 +40,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANpTpG.pm b/lib/HUPANpTpG.pm index d6ea3de..1ca1eb3 100644 --- a/lib/HUPANpTpG.pm +++ b/lib/HUPANpTpG.pm @@ -1,26 +1,86 @@ #!/usr/bin/perl package pTpG; + +sub pTpG{ use strict; use warnings; use Getopt::Std; -use vars qw($opt_e); -getopts("e"); +use vars qw($opt_h $opt_e $opt_f); +getopts("hef"); +my $usage="\nUsage: hupan pTpG [options] inputfile outputfile -sub pTpG{ -my $usage = "hupan perTranPerGene +hupan pTpG is used to obtain the longest transcript of each protein-coding genes from annotation file. +Notice: a file named \"protein_coding.gtf\" will be automatically prodcued to store all the protein-coding genes. + +Necessary input description: + + inputfile Annotation file. Default: \"gff\" format. + If the annotation file is \"gtf\", please use the option \"-f\". + + outputfile Outputfile. + +Options: -This tool is to obtain the longest trancript of each gene. + -h Print this usage page. -Option: - -e Check \"exon\" length instead of check \"CDS\" length - Note \"exon\" or \"CDS\" should exist in the 3rd column of the input file + -e Check \"exon\" length instead of check \"CDS\" length. + -f The annotation file is \"gtf\" format. "; + +#print @ARGV; die $usage if @ARGV!=2; -my ($file,$out) = @ARGV; -my ($Vec, $num) = @{Read_GTF($file)}; +die $usage if defined($opt_h); +my ($gff_file,$out_file)=@ARGV; + +my $format="gff"; +$format="gtf" if defined($opt_f); + +open(IN,$gff_file)||die("Error: cannot read the file: $gff_file.\n"); +my $temp_file="protein_coding.gtf"; +open(OUT,">$temp_file")||die("Error: cannot write the file: $temp_file.\n"); +my $pre_gid=""; +my $pre_tid=""; +my %genes; + +while(my $line=){ + chomp $line; + next if $line=~/^#/; + my ($chr,$source,$type,$start,$end,$score,$sym,$phase,$record)=split /\t/,$line; + if($format eq "gff"){ + my @string=split /[;|=]/,$record; + if($type eq "gene"){ + if($string[5] eq "protein_coding"){ + $pre_gid=$string[3]; + } + }else{ + if($string[5] eq $pre_gid){ + $pre_tid=$string[7]; + print OUT "$chr\t$source\t$type\t$start\t$end\t$score\t$sym\t$phase\tgene_id \"$pre_gid\"; transcript_id \"$pre_tid\";\n"; + } + } + }else{ + if($type eq "gene"){ + $record=~/(gene_id \"[^\"]+\"); (gene_type \"[^\"]+\")/; + if($2 eq "gene_type \"protein_coding\""){ + $pre_gid=$1; + } + }else{ + $record=~/(gene_id \"[^\"]+\"); (transcript_id \"[^\"]+\")/; + if($1 eq $pre_gid ){ + $pre_tid=$2; + print OUT "$chr\t$source\t$type\t$start\t$end\t$score\t$sym\t$phase\t$pre_gid; $pre_tid;\n"; + } + } + } +} +close IN; +close OUT; -open(OUT,">$out"); +my ($Vec, $num) = @{Read_GTF($temp_file)}; + +#open(OUT,">$out"); +open(OUT,">$out_file")||die("Error: cannot write the file: $out_file.\n"); foreach my $d (@{$Vec}){ my $target=$d->[0]; my $len=getLen($d->[0]); @@ -32,7 +92,7 @@ foreach my $d (@{$Vec}){ } } if($len!=0){ - OutputTran($target); + OutputTran($target); } } @@ -90,7 +150,6 @@ sub Read_GTF{ while(){ chomp; if(/^[0-9a-zA-Z]+/){ - if($begin == 1){ @temp = split("\t",$_); $temp[8]=~/(gene_id \"[^\"]+\"); (transcript_id \"[^\"]+\")/; diff --git a/lib/HUPANpTpGLSF.pm b/lib/HUPANpTpGLSF.pm index 74a6067..1ca1eb3 100644 --- a/lib/HUPANpTpGLSF.pm +++ b/lib/HUPANpTpGLSF.pm @@ -1,26 +1,86 @@ #!/usr/bin/perl package pTpG; + +sub pTpG{ use strict; use warnings; use Getopt::Std; -use vars qw($opt_e); -getopts("e"); +use vars qw($opt_h $opt_e $opt_f); +getopts("hef"); +my $usage="\nUsage: hupan pTpG [options] inputfile outputfile -sub pTpG{ -my $usage = "hupanLSF perTranPerGene +hupan pTpG is used to obtain the longest transcript of each protein-coding genes from annotation file. +Notice: a file named \"protein_coding.gtf\" will be automatically prodcued to store all the protein-coding genes. + +Necessary input description: + + inputfile Annotation file. Default: \"gff\" format. + If the annotation file is \"gtf\", please use the option \"-f\". + + outputfile Outputfile. + +Options: -This tool is to obtain the longest trancript of each gene. + -h Print this usage page. -Option: - -e Check \"exon\" length instead of check \"CDS\" length - Note \"exon\" or \"CDS\" should exist in the 3rd column of the input file + -e Check \"exon\" length instead of check \"CDS\" length. + -f The annotation file is \"gtf\" format. "; + +#print @ARGV; die $usage if @ARGV!=2; -my ($file,$out) = @ARGV; -my ($Vec, $num) = @{Read_GTF($file)}; +die $usage if defined($opt_h); +my ($gff_file,$out_file)=@ARGV; + +my $format="gff"; +$format="gtf" if defined($opt_f); + +open(IN,$gff_file)||die("Error: cannot read the file: $gff_file.\n"); +my $temp_file="protein_coding.gtf"; +open(OUT,">$temp_file")||die("Error: cannot write the file: $temp_file.\n"); +my $pre_gid=""; +my $pre_tid=""; +my %genes; + +while(my $line=){ + chomp $line; + next if $line=~/^#/; + my ($chr,$source,$type,$start,$end,$score,$sym,$phase,$record)=split /\t/,$line; + if($format eq "gff"){ + my @string=split /[;|=]/,$record; + if($type eq "gene"){ + if($string[5] eq "protein_coding"){ + $pre_gid=$string[3]; + } + }else{ + if($string[5] eq $pre_gid){ + $pre_tid=$string[7]; + print OUT "$chr\t$source\t$type\t$start\t$end\t$score\t$sym\t$phase\tgene_id \"$pre_gid\"; transcript_id \"$pre_tid\";\n"; + } + } + }else{ + if($type eq "gene"){ + $record=~/(gene_id \"[^\"]+\"); (gene_type \"[^\"]+\")/; + if($2 eq "gene_type \"protein_coding\""){ + $pre_gid=$1; + } + }else{ + $record=~/(gene_id \"[^\"]+\"); (transcript_id \"[^\"]+\")/; + if($1 eq $pre_gid ){ + $pre_tid=$2; + print OUT "$chr\t$source\t$type\t$start\t$end\t$score\t$sym\t$phase\t$pre_gid; $pre_tid;\n"; + } + } + } +} +close IN; +close OUT; -open(OUT,">$out"); +my ($Vec, $num) = @{Read_GTF($temp_file)}; + +#open(OUT,">$out"); +open(OUT,">$out_file")||die("Error: cannot write the file: $out_file.\n"); foreach my $d (@{$Vec}){ my $target=$d->[0]; my $len=getLen($d->[0]); @@ -32,7 +92,7 @@ foreach my $d (@{$Vec}){ } } if($len!=0){ - OutputTran($target); + OutputTran($target); } } @@ -90,7 +150,6 @@ sub Read_GTF{ while(){ chomp; if(/^[0-9a-zA-Z]+/){ - if($begin == 1){ @temp = split("\t",$_); $temp[8]=~/(gene_id \"[^\"]+\"); (transcript_id \"[^\"]+\")/; diff --git a/lib/HUPANpTpGSLURM.pm b/lib/HUPANpTpGSLURM.pm index 1e7de5d..1ca1eb3 100644 --- a/lib/HUPANpTpGSLURM.pm +++ b/lib/HUPANpTpGSLURM.pm @@ -1,26 +1,86 @@ #!/usr/bin/perl package pTpG; + +sub pTpG{ use strict; use warnings; use Getopt::Std; -use vars qw($opt_e); -getopts("e"); +use vars qw($opt_h $opt_e $opt_f); +getopts("hef"); +my $usage="\nUsage: hupan pTpG [options] inputfile outputfile -sub pTpG{ -my $usage = "hupanSLURM perTranPerGene +hupan pTpG is used to obtain the longest transcript of each protein-coding genes from annotation file. +Notice: a file named \"protein_coding.gtf\" will be automatically prodcued to store all the protein-coding genes. + +Necessary input description: + + inputfile Annotation file. Default: \"gff\" format. + If the annotation file is \"gtf\", please use the option \"-f\". + + outputfile Outputfile. + +Options: -This tool is to obtain the longest trancript of each gene. + -h Print this usage page. -Option: - -e Check \"exon\" length instead of check \"CDS\" length - Note \"exon\" or \"CDS\" should exist in the 3rd column of the input file + -e Check \"exon\" length instead of check \"CDS\" length. + -f The annotation file is \"gtf\" format. "; + +#print @ARGV; die $usage if @ARGV!=2; -my ($file,$out) = @ARGV; -my ($Vec, $num) = @{Read_GTF($file)}; +die $usage if defined($opt_h); +my ($gff_file,$out_file)=@ARGV; + +my $format="gff"; +$format="gtf" if defined($opt_f); + +open(IN,$gff_file)||die("Error: cannot read the file: $gff_file.\n"); +my $temp_file="protein_coding.gtf"; +open(OUT,">$temp_file")||die("Error: cannot write the file: $temp_file.\n"); +my $pre_gid=""; +my $pre_tid=""; +my %genes; + +while(my $line=){ + chomp $line; + next if $line=~/^#/; + my ($chr,$source,$type,$start,$end,$score,$sym,$phase,$record)=split /\t/,$line; + if($format eq "gff"){ + my @string=split /[;|=]/,$record; + if($type eq "gene"){ + if($string[5] eq "protein_coding"){ + $pre_gid=$string[3]; + } + }else{ + if($string[5] eq $pre_gid){ + $pre_tid=$string[7]; + print OUT "$chr\t$source\t$type\t$start\t$end\t$score\t$sym\t$phase\tgene_id \"$pre_gid\"; transcript_id \"$pre_tid\";\n"; + } + } + }else{ + if($type eq "gene"){ + $record=~/(gene_id \"[^\"]+\"); (gene_type \"[^\"]+\")/; + if($2 eq "gene_type \"protein_coding\""){ + $pre_gid=$1; + } + }else{ + $record=~/(gene_id \"[^\"]+\"); (transcript_id \"[^\"]+\")/; + if($1 eq $pre_gid ){ + $pre_tid=$2; + print OUT "$chr\t$source\t$type\t$start\t$end\t$score\t$sym\t$phase\t$pre_gid; $pre_tid;\n"; + } + } + } +} +close IN; +close OUT; -open(OUT,">$out"); +my ($Vec, $num) = @{Read_GTF($temp_file)}; + +#open(OUT,">$out"); +open(OUT,">$out_file")||die("Error: cannot write the file: $out_file.\n"); foreach my $d (@{$Vec}){ my $target=$d->[0]; my $len=getLen($d->[0]); @@ -32,7 +92,7 @@ foreach my $d (@{$Vec}){ } } if($len!=0){ - OutputTran($target); + OutputTran($target); } } @@ -90,7 +150,6 @@ sub Read_GTF{ while(){ chomp; if(/^[0-9a-zA-Z]+/){ - if($begin == 1){ @temp = split("\t",$_); $temp[8]=~/(gene_id \"[^\"]+\"); (transcript_id \"[^\"]+\")/; diff --git a/lib/HUPANqualSta.pm b/lib/HUPANqualSta.pm index e73747d..142c4f9 100644 --- a/lib/HUPANqualSta.pm +++ b/lib/HUPANqualSta.pm @@ -5,11 +5,11 @@ sub checkQual{ use strict; use warnings; use Getopt::Std; - use vars qw($opt_h $opt_f $opt_t $opt_v); - getopts("hf:t:v:"); + use vars qw($opt_h $opt_f $opt_t); + getopts("hf:t:"); my $usage="\nUsage: hupan qualSta [options] -qualSta is used to check qualities of .fastq/.fastq.gz files on a large scale. +qualSta is used to check qualities of \".fq.gz\"/\".fastq.gz\" files on a large scale. The script will call fastqc program, so please make sure fastqc is in your PATH, or you need to use -f option to tell the program where fastqc locates. @@ -17,13 +17,13 @@ PATH, or you need to use -f option to tell the program where fastqc locates. Necessary input description: data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq or .fastq.gz. + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -40,10 +40,7 @@ Options: program. It is recommended to set as the number of files within each sample. Pay attention that the machine should have this number of threads. - default: 1 - - -v Sets to: PE, if all files are PE data. - + Default: 1 "; die $usage if @ARGV!=2; @@ -52,10 +49,7 @@ Options: #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Detect executable fastqc @@ -130,7 +124,7 @@ To avoid overwriting of existing files. We kindly request that the push @fastq, $f; } else{ - print STDERR "Warning: $f is not a .fastq or .fastq.gz file! => Not processed!\n"; + print STDERR "Warning: $f is not a .fq.gz or .fastq.gz file! => Not processed!\n"; } } #generate commandline @@ -185,11 +179,6 @@ sub mergeFastqc{ next if $f=~/^\./; next if $f=~/\.zip$/; next if $f=~/\.html$/; - if(defined($opt_v)){ - if($opt_v eq "PE"){ - next if $f=~/^single/; - } - } my $fd=$sd."/".$f."/"."summary.txt"; open(FILE,$fd) ||die("Unable to open fastqc output file: $fd\n"); my @tmp=; diff --git a/lib/HUPANqualStaLSF.pm b/lib/HUPANqualStaLSF.pm index ec5a929..fd55324 100644 --- a/lib/HUPANqualStaLSF.pm +++ b/lib/HUPANqualStaLSF.pm @@ -10,7 +10,7 @@ sub checkQual{ my $usage="\nUsage: hupanLSF qualSta [options] -qualSta is used to check qualities of .fastq/.fastq.gz files on a large scale. +qualSta is used to check qualities of \".fq.gz\"/\".fastq.gz\" files on a large scale. The script will call fastqc program, so please make sure fastqc is in your PATH, or you need to use -f option to tell the script where it locates. @@ -18,13 +18,13 @@ PATH, or you need to use -f option to tell the script where it locates. Necessary input description: data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq or .fastq.gz. + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -41,10 +41,10 @@ Options: program. It is recommended to set as the number of files within each sample. Pay attention that the machine should have this number of threads. - default: 1 + Default: 1 -q The queue name for job submiting. - default: default queue + Default: default queue "; die $usage if @ARGV!=2; @@ -53,9 +53,7 @@ Options: #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist."); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist."); } #Detect executable fastqc @@ -139,7 +137,7 @@ mkdir($stdout_out); push @fastq, $f; } else{ - print STDERR "Warning: $f is not a .fastq or .fastq.gz file! => Not processed!\n"; + print STDERR "Warning: $f is not a .fq.gz or .fastq.gz file! => Not processed!\n"; } } #generate commandline diff --git a/lib/HUPANqualStaSLURM.pm b/lib/HUPANqualStaSLURM.pm index 7b1398e..be1d7d7 100644 --- a/lib/HUPANqualStaSLURM.pm +++ b/lib/HUPANqualStaSLURM.pm @@ -10,7 +10,7 @@ sub checkQual{ my $usage="\nUsage: hupanSLURM qualSta [options] -qualSta is used to check qualities of .fastq/.fastq.gz files on a large scale. +qualSta is used to check qualities of \".fq.gz\"/\".fastq.gz\" files on a large scale. The script will call fastqc program, so please make sure fastqc is in your PATH, or you need to use -f option to tell the script where it locates. @@ -18,13 +18,13 @@ PATH, or you need to use -f option to tell the script where it locates. Necessary input description: data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq or .fastq.gz. + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -41,10 +41,10 @@ Options: program. It is recommended to set as the number of files within each sample. Pay attention that the machine should have this number of threads. - default: 1 + Default: 1 -q The queue name for job submiting. - default: default queue + Default: default queue "; die $usage if @ARGV!=2; @@ -53,9 +53,7 @@ Options: #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist."); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist."); } #Detect executable fastqc @@ -139,7 +137,7 @@ mkdir($stdout_out); push @fastq, $f; } else{ - print STDERR "Warning: $f is not a .fastq or .fastq.gz file! => Not processed!\n"; + print STDERR "Warning: $f is not a .fq.gz or .fastq.gz file! => Not processed!\n"; } } #generate commandline diff --git a/lib/HUPANrmContaminate.pm b/lib/HUPANrmContaminate.pm index 9775f7b..0754e8f 100644 --- a/lib/HUPANrmContaminate.pm +++ b/lib/HUPANrmContaminate.pm @@ -22,18 +22,18 @@ Necessary input description: output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. Options: - -h Print this usage page. + -h Print this usage page. - -l The local alignment length. (default: 100 bp) + -l The local alignment length. (default: 100 bp) - -i The local alignment identity. (default: 90%) + -i The local alignment identity. (default: 90%) "; @@ -43,7 +43,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANrmContaminateLSF.pm b/lib/HUPANrmContaminateLSF.pm index 9a88075..9f8ebad 100644 --- a/lib/HUPANrmContaminateLSF.pm +++ b/lib/HUPANrmContaminateLSF.pm @@ -22,18 +22,18 @@ Necessary input description: output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. Options: - -h Print this usage page. + -h Print this usage page. - -l The local alignment length. (default: 100 bp) + -l The local alignment length. (default: 100 bp) - -i The local alignment identity. (default: 90%) + -i The local alignment identity. (default: 90%) "; @@ -43,7 +43,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANrmContaminateSLURM.pm b/lib/HUPANrmContaminateSLURM.pm index addb025..dc149a8 100644 --- a/lib/HUPANrmContaminateSLURM.pm +++ b/lib/HUPANrmContaminateSLURM.pm @@ -22,18 +22,18 @@ Necessary input description: output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. Options: - -h Print this usage page. + -h Print this usage page. - -l The local alignment length. (default: 100 bp) + -l The local alignment length. (default: 100 bp) - -i The local alignment identity. (default: 90%) + -i The local alignment identity. (default: 90%) "; @@ -43,7 +43,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the \noutput directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANrmHigh.pm b/lib/HUPANrmHigh.pm index 7ac8153..d240a35 100644 --- a/lib/HUPANrmHigh.pm +++ b/lib/HUPANrmHigh.pm @@ -16,13 +16,13 @@ The script will call MUMmer program, so you need to tell the program where MUMme Necessary input description: data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, assembly results, including - files *.scafSeq and *.contig, should exist. + files \"*.scafSeq\" and \"*.contig\", should exist. output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -32,10 +32,10 @@ Necessary input description: Reference sequence file (.fa or .fa.gz). Options: - -h Print this usage page. + -h Print this usage page. - -s Suffix of assembled file. - Defult: contigs.fa + -s Suffix of assembled file. + Defult: \"contigs.fa\" "; die $usage if @ARGV!=4; @@ -44,10 +44,7 @@ Options: #Check existence of output directory if(-e $out_dir){ -die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); +die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Detect executable nucmer and show-coords @@ -144,42 +141,41 @@ extractSeq is used to extract contigs that is lower similarity with reference ge Necessary input description: data_dir This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, assembly results, including - files *.scafSeq and *.contig, should exist. + files \"*.scafSeq\" and \"*.contig\", should exist. out_dir Results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. MUMmer_output_dir This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, alignment results, including - files *.coords, should exist. + files \"*.coords\", should exist. Options: - -h Print this usage page. + -h Print this usage page. - -i The theshold of identity . + -i The theshold of identity. + Default: 0.95. - -c The theshold of query coverage . + -c The theshold of query coverage. + Default: 0.95. - -s Suffix of assembled file. - Defult: contigs.fa + -s Suffix of assembled file. + Defult: \"contigs.fa\" "; die $usage if @ARGV!=3; die $usage if defined($opt_h); my ($contig_dir,$out_dir,$coords_dir)=@ARGV; -#Check existence of output directory -if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); -} + #Check existence of output directory + if(-e $out_dir){ + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); + } #get identity my $identity=0.95; @@ -251,7 +247,7 @@ foreach my $s (@sample){ while(){ chomp; if($_=~/^>/){ - my @t=split / /,$_; + my @t=split /\s+/,$_; my $name=substr($t[0],1,length($t[0])-1); $names{$name}=1; } @@ -273,7 +269,8 @@ foreach my $s (@sample){ } else{ chomp; - my @t=split ' ',$_; + $_=~s/^ +//; + my @t=split /\s+/,$_; #print @t; my $i=$t[9]; my $c=$t[15]; @@ -287,7 +284,7 @@ foreach my $s (@sample){ } close FILE2; my $length2=keys %contigs; - print "There are: ".$length2." contigs highly similarity with the reference genome with >= ".($identity*100)."% identity and >= ".($coverage*100)."% coverage .\n"; + print "There are: ".$length2." contigs highly similarity with the reference genome with >= ".($identity*100)."% identity and >= ".($coverage*100)."% coverage.\n"; #Remove the contigs name of high similarity with reference genome foreach my $key (keys %names){ @@ -306,7 +303,7 @@ foreach my $s (@sample){ while(my $line=){ chomp $line; if($line=~/^>/){ - my @t=split / /,$line; + my @t=split /\s+/,$line; my $name=substr($t[0],1,length($t[0])-1); if(exists $names{$name}){ print FILE4 $line."\n"; diff --git a/lib/HUPANrmHighLSF.pm b/lib/HUPANrmHighLSF.pm index a242dbd..29c4993 100644 --- a/lib/HUPANrmHighLSF.pm +++ b/lib/HUPANrmHighLSF.pm @@ -16,13 +16,13 @@ The script will call MUMmer program, so you need to tell the program where MUMme Necessary input description: data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, assembly results, including - files *.scafSeq and *.contig, should exist. + files \"*.scafSeq\" and \"*.contig\", should exist. output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -32,13 +32,13 @@ Necessary input description: Reference sequence file (.fa or .fa.gz). Options: - -h Print this usage page. + -h Print this usage page. - -q The queue name for job submiting. - Default: default queue + -q The queue name for job submiting. + Default: default queue - -s Suffix of assembled file. - Defult: contigs.fa + -s Suffix of assembled file. + Defult: contigs.fa "; die $usage if @ARGV!=4; @@ -47,10 +47,7 @@ Options: #Check existence of output directory if(-e $out_dir){ -die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); +die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Detect executable nucmer and show-coords @@ -160,42 +157,41 @@ extractSeq is used to extract contigs that is lower similarity with reference ge Necessary input description: data_dir This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, assembly results, including - files *.scafSeq and *.contig, should exist. + files \"*.scafSeq\" and \"*.contig\", should exist. out_dir Results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. MUMmer_output_dir This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, alignment results, including files *.coords, should exist. Options: - -h Print this usage page. + -h Print this usage page. + + -i The theshold of identity. + Default: 0.95 - -i The theshold of identity . - - -c The theshold of query coverage . - - -s Suffix of assembled file. - Defult: contigs.fa + -c The theshold of query coverage. + Default: 0.95 + + -s Suffix of assembled file. + Defult: \"contigs.fa\" "; die $usage if @ARGV!=3; die $usage if defined($opt_h); my ($contig_dir,$out_dir,$coords_dir)=@ARGV; -#Check existence of output directory -if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); -} + #Check existence of output directory + if(-e $out_dir){ + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); + } #get identity my $identity=0.95; @@ -267,7 +263,7 @@ foreach my $s (@sample){ while(){ chomp; if($_=~/^>/){ - my @t=split / /,$_; + my @t=split /\s+/,$_; my $name=substr($t[0],1,length($t[0])-1); $names{$name}=1; } @@ -289,7 +285,8 @@ foreach my $s (@sample){ } else{ chomp; - my @t=split ' ',$_; + $_=~s/^ +//; + my @t=split /\s+/,$_; #print @t; my $i=$t[9]; my $c=$t[15]; @@ -322,7 +319,7 @@ foreach my $s (@sample){ while(my $line=){ chomp $line; if($line=~/^>/){ - my @t=split / /,$line; + my @t=split /\s+/,$line; my $name=substr($t[0],1,length($t[0])-1); if(exists $names{$name}){ print FILE4 $line."\n"; diff --git a/lib/HUPANrmHighSLURM.pm b/lib/HUPANrmHighSLURM.pm index 34530b5..03a720e 100644 --- a/lib/HUPANrmHighSLURM.pm +++ b/lib/HUPANrmHighSLURM.pm @@ -16,13 +16,13 @@ The script will call MUMmer program, so you need to tell the program where MUMme Necessary input description: data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, assembly results, including - files *.scafSeq and *.contig, should exist. + files \"*.scafSeq\" and \"*.contig\", should exist. output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -32,16 +32,16 @@ Necessary input description: Reference sequence file (.fa or .fa.gz). Options: - -h Print this usage page. + -h Print this usage page. - -q The queue name for job submiting. - Default: default queue + -q The queue name for job submiting. + Default: default queue - -t Threads used. - Default: 1. + -t Threads used. + Default: 1. - -s Suffix of assembled file. - Defult: contigs.fa + -s Suffix of assembled file. + Defult: \"contigs.fa\" "; die $usage if @ARGV!=4; @@ -51,7 +51,7 @@ Options: #Check existence of output directory if(-e $out_dir){ die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the +To avoid overwriting of existing files, we kindly request that the output directory should not exist. "); } @@ -171,30 +171,32 @@ extractSeq is used to extract contigs that is lower similarity with reference ge Necessary input description: data_dir This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, assembly results, including files *.scafSeq and *.contigs.fa, should exist. out_dir Results will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. MUMmer_output_dir This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, alignment results, including files *.coords, should exist. Options: - -h Print this usage page. + -h Print this usage page. - -i The theshold of identity . + -i The theshold of identity. + Default: 0.95. - -c The theshold of query coverage . + -c The theshold of query coverage. + Default: 0.95. - -s Suffix of assembled file. - Defult: contigs.fa + -s Suffix of assembled file. + Defult: \"contigs.fa\" "; die $usage if @ARGV!=3; die $usage if defined($opt_h); @@ -202,10 +204,7 @@ Options: #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #get identity @@ -278,7 +277,7 @@ foreach my $s (@sample){ while(){ chomp; if($_=~/^>/){ - my @t=split / /,$_; + my @t=split /\s+/,$_; my $name=substr($t[0],1,length($t[0])-1); $names{$name}=1; } @@ -300,7 +299,8 @@ foreach my $s (@sample){ } else{ chomp; - my @t=split ' ',$_; + $_=~s/^ +//; + my @t=split /\s+/,$_; #print @t; my $i=$t[9]; my $c=$t[15]; @@ -333,7 +333,7 @@ foreach my $s (@sample){ while(my $line=){ chomp $line; if($line=~/^>/){ - my @t=split / /,$line; + my @t=split /\s+/,$line; my $name=substr($t[0],1,length($t[0])-1); if(exists $names{$name}){ print FILE4 $line."\n"; diff --git a/lib/HUPANrmRdt.pm b/lib/HUPANrmRdt.pm index 6d5e9ae..6ef455d 100644 --- a/lib/HUPANrmRdt.pm +++ b/lib/HUPANrmRdt.pm @@ -69,7 +69,7 @@ die(" $exe doesn't exist! if(-e $out_dir){ die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the +To avoid overwriting of existing files, we kindly request that the output directory should not exist. "); } @@ -175,7 +175,7 @@ die("Executable blastCluster.pl cannot be found in your PATH!\n die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the +To avoid overwriting of existing files, we kindly request that the output directory should not exist. ") if -e $out_dir; diff --git a/lib/HUPANrmRdtLSF.pm b/lib/HUPANrmRdtLSF.pm index e592112..8ef92a0 100644 --- a/lib/HUPANrmRdtLSF.pm +++ b/lib/HUPANrmRdtLSF.pm @@ -71,7 +71,7 @@ die(" $exe doesn't exist! if(-e $out_dir){ die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the +To avoid overwriting of existing files, we kindly request that the output directory should not exist. "); } @@ -195,7 +195,7 @@ die("Executable blastCluster.pl cannot be found in your PATH!\n die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the +To avoid overwriting of existing files, we kindly request that the output directory should not exist. ") if -e $out_dir; diff --git a/lib/HUPANrmRdtSLURM.pm b/lib/HUPANrmRdtSLURM.pm index 90e6f8e..431418a 100644 --- a/lib/HUPANrmRdtSLURM.pm +++ b/lib/HUPANrmRdtSLURM.pm @@ -71,7 +71,7 @@ die(" $exe doesn't exist! if(-e $out_dir){ die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the +To avoid overwriting of existing files, we kindly request that the output directory should not exist. "); } @@ -196,7 +196,7 @@ die("Executable blastCluster.pl cannot be found in your PATH!\n die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the +To avoid overwriting of existing files, we kindly request that the output directory should not exist. ") if -e $out_dir; diff --git a/lib/HUPANsamToBam.pm b/lib/HUPANsamToBam.pm index ea2e8ab..897bbac 100644 --- a/lib/HUPANsamToBam.pm +++ b/lib/HUPANsamToBam.pm @@ -17,22 +17,22 @@ The script will call samtools program, so the directory where samtools locates i Necessary input description: mapping_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, One or more mapping results, - *.sam, should exist. + \"*.sam\", should exist. - output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + output_directory Results will be output to this directory.To avoid + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - QUAST_directory samtools directory where executable samtools locates. + QUAST_directory samtools directory where executable samtools locates. Options: -h Print this usage page. - -t Threads used. + -t Threads used. Default: 1 "; @@ -48,10 +48,7 @@ die("Error01: Cannot find samtools file in directory $samtools_dir\n") unless(-e #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads diff --git a/lib/HUPANsamToBamLSF.pm b/lib/HUPANsamToBamLSF.pm index 5c1090b..d636409 100644 --- a/lib/HUPANsamToBamLSF.pm +++ b/lib/HUPANsamToBamLSF.pm @@ -17,26 +17,26 @@ The script will call samtools program, so the directory where samtools locates i Necessary input description: mapping_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. - In each sub-directory, One or more mapping results, - *.sam, should exist. + named by sample names, such as Sample1, Sample2,etc. + In each sub-directory, one or more mapping results, + \"*.sam\", should exist. - output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + output_directory Results will be output to this directory.To avoid + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - QUAST_directory samtools directory where executable samtools locates. + QUAST_directory samtools directory where executable samtools locates. Options: -h Print this usage page. - -t Threads used. + -t Threads used. Default: 1 - -q The queue name for job submiting. - default: default queue + -q The queue name for job submiting. + Default: default queue "; die $usage if @ARGV!=3; @@ -50,9 +50,7 @@ die("Error01: Cannot find samtools file in directory $samtools_dir\n") unless(-e #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist."); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads diff --git a/lib/HUPANsamToBamSLURM.pm b/lib/HUPANsamToBamSLURM.pm index 3bc85d1..27c8463 100644 --- a/lib/HUPANsamToBamSLURM.pm +++ b/lib/HUPANsamToBamSLURM.pm @@ -17,26 +17,26 @@ The script will call samtools program, so the directory where samtools locates i Necessary input description: mapping_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, One or more mapping results, - *.sam, should exist. + \"*.sam\", should exist. - output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + output_directory Results will be output to this directory.To avoid + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - QUAST_directory samtools directory where executable samtools locates. + QUAST_directory samtools directory where executable samtools locates. Options: -h Print this usage page. - -t Threads used. + -t Threads used. Default: 1 - -q The queue name for job submiting. - default: default queue + -q The queue name for job submiting. + Default: default queue "; die $usage if @ARGV!=3; @@ -50,9 +50,7 @@ die("Error01: Cannot find samtools file in directory $samtools_dir\n") unless(-e #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist."); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Read threads diff --git a/lib/HUPANsim.pm b/lib/HUPANsim.pm index b3dad7f..1b95595 100644 --- a/lib/HUPANsim.pm +++ b/lib/HUPANsim.pm @@ -15,11 +15,11 @@ sim is used to simulate the size of pan-genome and core-genome from gene presenc Necessary input description: - data_path This path leads to gene.exist or geneFam.exist + data_path This path leads to gene.exist or geneFam.exist output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -29,7 +29,7 @@ Options: -n Specifies the number of random sampling times for simulation. - default: 100 + Default: 100 "; @@ -39,10 +39,7 @@ Options: #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #get simulation number diff --git a/lib/HUPANsimLSF.pm b/lib/HUPANsimLSF.pm index 60a26db..390cdd1 100644 --- a/lib/HUPANsimLSF.pm +++ b/lib/HUPANsimLSF.pm @@ -15,11 +15,11 @@ sim is used to simulate the size of pan-genome and core-genome from gene presenc Necessary input description: - data_path This path leads to gene.exist or geneFam.exist + data_path This path leads to gene.exist or geneFam.exist output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -29,7 +29,7 @@ Options: -n Specifies the number of random sampling times for simulation. - default: 100 + Default: 100 "; @@ -39,10 +39,7 @@ Options: #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #get simulation number diff --git a/lib/HUPANsimSLURM.pm b/lib/HUPANsimSLURM.pm index f99ce53..277e338 100644 --- a/lib/HUPANsimSLURM.pm +++ b/lib/HUPANsimSLURM.pm @@ -15,11 +15,11 @@ sim is used to simulate the size of pan-genome and core-genome from gene presenc Necessary input description: - data_path This path leads to gene.exist or geneFam.exist + data_path This path leads to gene.exist or geneFam.exist output_directory Both final output files and intermediate results will be found in this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -29,7 +29,7 @@ Options: -n Specifies the number of random sampling times for simulation. - default: 100 + Default: 100 "; @@ -39,10 +39,7 @@ Options: #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #get simulation number diff --git a/lib/HUPANsimSeq.pm b/lib/HUPANsimSeq.pm index 85ff978..52cc6c5 100644 --- a/lib/HUPANsimSeq.pm +++ b/lib/HUPANsimSeq.pm @@ -43,15 +43,15 @@ Necessary input description: data_path This directory should contain many sub-directories named by sample names, such as sample1, sample2,etc. In each sub-directory, there should be novel sequences - file ended by .contig. + file ended by \".contig\". out_path Results will be output to this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - cdhit_directory directory where cdhit-est locates. + cdhit_directory Directory where cd-hit-est locates. Options: -h Print this usage. @@ -73,7 +73,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #get simulation number @@ -172,8 +172,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We -kindly request that the\n output directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the\n output directory should not exist.\n"); } $out_dir.="/" unless $out_dir=~/\/$/; mkdir $out_dir; diff --git a/lib/HUPANsimSeqLSF.pm b/lib/HUPANsimSeqLSF.pm index 3f1e6a7..fa1a01c 100644 --- a/lib/HUPANsimSeqLSF.pm +++ b/lib/HUPANsimSeqLSF.pm @@ -43,15 +43,15 @@ Necessary input description: data_path This directory should contain many sub-directories named by sample names, such as sample1, sample2,etc. In each sub-directory, there should be novel sequences - file ended by .contig. + file ended by \".contig\". out_path Results will be output to this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - cdhit_directory directory where cdhit-est locates. + cdhit_directory Directory where cd-hit-est locates. Options: -h Print this usage. @@ -73,7 +73,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #get simulation number @@ -196,8 +196,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We -kindly request that the\n output directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the\n output directory should not exist.\n"); } $out_dir.="/" unless $out_dir=~/\/$/; mkdir $out_dir; diff --git a/lib/HUPANsimSeqSLURM.pm b/lib/HUPANsimSeqSLURM.pm index 76054bb..74d83d2 100644 --- a/lib/HUPANsimSeqSLURM.pm +++ b/lib/HUPANsimSeqSLURM.pm @@ -43,15 +43,15 @@ Necessary input description: data_path This directory should contain many sub-directories named by sample names, such as sample1, sample2,etc. In each sub-directory, there should be novel sequences - file ended by .contig. + file ended by \".contig\". out_path Results will be output to this directory. To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. - cdhit_directory directory where cdhit-est locates. + cdhit_directory Directory where cd-hit-est locates. Options: -h Print this usage. @@ -73,7 +73,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #get simulation number @@ -197,7 +197,7 @@ Options: #check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files. We + die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of existing files, we kindly request that the\n output directory should not exist.\n"); } $out_dir.="/" unless $out_dir=~/\/$/; diff --git a/lib/HUPANsplitSeq.pm b/lib/HUPANsplitSeq.pm index 2527898..4cb4d44 100644 --- a/lib/HUPANsplitSeq.pm +++ b/lib/HUPANsplitSeq.pm @@ -32,8 +32,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of - existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANsplitSeqLSF.pm b/lib/HUPANsplitSeqLSF.pm index 075bd89..b7bbc6f 100644 --- a/lib/HUPANsplitSeqLSF.pm +++ b/lib/HUPANsplitSeqLSF.pm @@ -32,8 +32,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of - existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANsplitSeqSLURM.pm b/lib/HUPANsplitSeqSLURM.pm index 85b5c0d..030be9a 100644 --- a/lib/HUPANsplitSeqSLURM.pm +++ b/lib/HUPANsplitSeqSLURM.pm @@ -32,8 +32,7 @@ Options: #check existense of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists.\nTo avoid overwriting of - existing files. We kindly request that the \noutput directory should not exist.\n"); + die("Error: output directory \"$out_dir\" already exists.To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #adjust directory names and create output directory diff --git a/lib/HUPANtrim.pm b/lib/HUPANtrim.pm index 2be01e6..d7a1244 100644 --- a/lib/HUPANtrim.pm +++ b/lib/HUPANtrim.pm @@ -19,12 +19,12 @@ given to the script as a necessary input. Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq(or .fq) or .fastq.gz(or .fq.gz). + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory High-quality reads will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -34,21 +34,22 @@ Necessary input description: Options: -h Print this usage page. - -t thread number + -t Thread number. + Default: 1 -a Adaptor file in fasta utilized by trimmomatic program. Default: trimmomatoc_dir/adapters/TruSeq3-PE-2.fa -s Suffix of the fastq_file. Check your sequencing data and change it if needed. - Default: \".fq.gz\" + Default: \".fastq.gz\" -k Linker for paired_end identifer. Paired-end fastq file should end with *1suffix or *2suffix, where suffix is - \".fq.gz\"( or \".fastq\", etc. See -s option) and * is the + \".fq.gz\"( or \".fastq.gz\", etc. See -s option) and * is the linker such as \"_\".As an example, the file should - be like CX123_1.fq.gz (linker is \"_\", suffix is \".fq.gz\") - or BX125_R1.fastq(linker is \"_R\", suffix is \".fastq\") + be like Sample1.fq.gz (linker is \"_\", suffix is \".fq.gz\") + or Sample2.fastq.gz(linker is \"_R\", suffix is \".fastq.gz\") Default: \"_\" -p <33 or 64> Quality score version. @@ -83,10 +84,7 @@ my ($data_dir,$out_dir,$trim_dir)=@ARGV; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Detect executable fastqc @@ -124,7 +122,7 @@ if(defined($opt_a)){ die("Error: unable to find trimmomatic adaptor file: $trim_adaptor\n") unless(-e $trim_adaptor); #read fastq suffix -my $suffix=".fq.gz"; +my $suffix=".fastq.gz"; if(defined($opt_s)){ $suffix=$opt_s; } diff --git a/lib/HUPANtrimLSF.pm b/lib/HUPANtrimLSF.pm index 82020d0..50f4e15 100644 --- a/lib/HUPANtrimLSF.pm +++ b/lib/HUPANtrimLSF.pm @@ -19,12 +19,12 @@ given to the script as a necessary input. Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq(or .fq) or .fastq.gz(or .fq.gz). + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory High-quality reads will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -34,22 +34,24 @@ Necessary input description: Options: -h Print this usage page. - -t thread number + -t Thread number. + Default: 1 -a Adaptor file in fasta utilized by trimmomatic program. Default: trimmomatoc_dir/adapters/TruSeq3-PE-2.fa - + -s Suffix of the fastq_file. Check your sequencing data and change it if needed. - Default: \".fq.gz\" + Default: \".fastq.gz\" -k Linker for paired_end identifer. Paired-end fastq file should end with *1suffix or *2suffix, where suffix is - \".fq.gz\"( or \".fastq\", etc. See -s option) and * is the + \".fq.gz\"( or \".fastq.gz\", etc. See -s option) and * is the linker such as \"_\".As an example, the file should - be like CX123_1.fq.gz (linker is \"_\", suffix is \".fq.gz\") - or BX125_R1.fastq(linker is \"_R\", suffix is \".fastq\") + be like Sample1.fq.gz (linker is \"_\", suffix is \".fq.gz\") + or Sample2.fastq.gz(linker is \"_R\", suffix is \".fastq.gz\") Default: \"_\" + -p <33 or 64> Quality score version. Default: 33 (phred+33) @@ -84,10 +86,7 @@ my ($data_dir,$out_dir,$trim_dir)=@ARGV; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Detect executable fastqc diff --git a/lib/HUPANtrimSLURM.pm b/lib/HUPANtrimSLURM.pm index daf89fc..1aeb6c1 100644 --- a/lib/HUPANtrimSLURM.pm +++ b/lib/HUPANtrimSLURM.pm @@ -1,5 +1,6 @@ #!/usr/bin/perl #Created by Hu Zhiqiang, 2014-7-2 + package trim; sub trimFastq{ use strict; @@ -19,12 +20,12 @@ given to the script as a necessary input. Necessary input description: fastq_data_directory This directory should contain many sub-directories - named by sample names, such as CX101, B152,etc. + named by sample names, such as Sample1, Sample2,etc. In each sub-directory, there should be several - sequencing files ended by .fastq(or .fq) or .fastq.gz(or .fq.gz). + sequencing files ended by \".fq.gz\" or \".fastq.gz\". output_directory High-quality reads will be output to this directory. - To avoid overwriting of existing files. We kindly request + To avoid overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -34,22 +35,24 @@ Necessary input description: Options: -h Print this usage page. - -t thread number + -t Thread number. + Default: 1 -a Adaptor file in fasta utilized by trimmomatic program. Default: trimmomatoc_dir/adapters/TruSeq3-PE-2.fa -s Suffix of the fastq_file. Check your sequencing data and change it if needed. - Default: \".fq.gz\" + Default: \".fastq.gz\" -k Linker for paired_end identifer. Paired-end fastq file should end with *1suffix or *2suffix, where suffix is - \".fq.gz\"( or \".fastq\", etc. See -s option) and * is the + \".fq.gz\"( or \".fastq.gz\", etc. See -s option) and * is the linker such as \"_\".As an example, the file should - be like CX123_1.fq.gz (linker is \"_\", suffix is \".fq.gz\") - or BX125_R1.fastq(linker is \"_R\", suffix is \".fastq\") + be like Sample1.fq.gz (linker is \"_\", suffix is \".fq.gz\") + or Sample2.fastq.gz(linker is \"_R\", suffix is \".fastq.gz\") Default: \"_\" + -p <33 or 64> Quality score version. Default: 33 (phred+33) @@ -84,10 +87,7 @@ my ($data_dir,$out_dir,$trim_dir)=@ARGV; #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } #Detect executable fastqc diff --git a/lib/HUPANunalnCtg.pm b/lib/HUPANunalnCtg.pm index 9dffd5c..b541bea 100644 --- a/lib/HUPANunalnCtg.pm +++ b/lib/HUPANunalnCtg.pm @@ -9,8 +9,8 @@ sub getUnaln{ use strict; use warnings; use Getopt::Std; -use vars qw($opt_h $opt_p); -getopts("hp:"); +use vars qw($opt_h $opt_s); +getopts("hs:"); my $usage="\nUsage: hupan getUnalnCtg [options] @@ -21,15 +21,15 @@ Necessary input description: assembly_directory This directory should contain many sub-directories named by sample names, such as sample1, sample2,etc. In each sub-directory, assembly results, including - file *.contig, should exist. + file \"*.contig.gz\", should exist. QUAST_assess_directory This directory should contain many sub-directories named by sample names, such as sample1, sample2,etc. In each sub-directory, quast assessment, including - directory file contigs_reports, should exist. + directory file \"contigs_reports\", should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -37,7 +37,8 @@ Necessary input description: Options: -h Print this usage page. - -p The suffix of contigs file in assembly directory. + -s The suffix of contigs file in assembly directory. + Default: \".contig.gz\" "; @@ -66,15 +67,12 @@ $quast_dir.="/" unless($quast_dir=~/\/$/); $out_dir.="/" unless($out_dir=~/\/$/); #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } my $thread_num=1; my $suffix=".contig.gz"; -$suffix=$opt_p if defined $opt_p; +$suffix=$opt_s if defined $opt_s; #Create output directory and sub-directories mkdir($out_dir); my $out_data=$out_dir."data/"; diff --git a/lib/HUPANunalnCtgLSF.pm b/lib/HUPANunalnCtgLSF.pm index 26509c9..85f6f6c 100644 --- a/lib/HUPANunalnCtgLSF.pm +++ b/lib/HUPANunalnCtgLSF.pm @@ -9,8 +9,8 @@ sub getUnaln{ use strict; use warnings; use Getopt::Std; -use vars qw($opt_h $opt_p); -getopts("hp:"); +use vars qw($opt_h $opt_s); +getopts("hs:"); my $usage="\nUsage: hupanLSF getUnalnCtg [options] @@ -21,15 +21,15 @@ Necessary input description: assembly_directory This directory should contain many sub-directories named by sample names, such as sample1, sample2,etc. In each sub-directory, assembly results, including - file *.contig, should exist. + file \"*.contig.gz\", should exist. QUAST_assess_directory This directory should contain many sub-directories named by sample names, such as sample1, sample2,etc. In each sub-directory, quast assessment, including - directory file contigs_reports, should exist. + directory file \"contigs_reports\", should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -37,8 +37,8 @@ Necessary input description: Options: -h Print this usage page. - -p The suffix of contigs file in assembly directory. - default: .contig.gz + -s The suffix of contigs file in assembly directory. + Default: \".contig.gz\" "; @@ -66,15 +66,12 @@ $quast_dir.="/" unless($quast_dir=~/\/$/); $out_dir.="/" unless($out_dir=~/\/$/); #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } my $thread_num=1; my $suffix=".contig.gz"; -$suffix=$opt_p if defined $opt_p; +$suffix=$opt_s if defined $opt_s; #Create output directory and sub-directories mkdir($out_dir); my $out_data=$out_dir."data/"; @@ -205,7 +202,7 @@ Necessary input description: *.partially.contig and *.partially.coords, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -224,10 +221,7 @@ $out.="/" unless($out=~/\/$/); ##Check existence of output directory if(-e $out){ - die("Error: output directory \"$out\" already exists. - To avoid overwriting of existing files. We kindly request that the - output directory should not exist. - "); + die("Error: output directory \"$out\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } mkdir($out); diff --git a/lib/HUPANunalnCtgSLURM.pm b/lib/HUPANunalnCtgSLURM.pm index 5a72f52..345fd04 100644 --- a/lib/HUPANunalnCtgSLURM.pm +++ b/lib/HUPANunalnCtgSLURM.pm @@ -9,8 +9,8 @@ sub getUnaln{ use strict; use warnings; use Getopt::Std; -use vars qw($opt_h $opt_p); -getopts("hp:"); +use vars qw($opt_h $opt_s); +getopts("hs:"); my $usage="\nUsage: hupanSLURM getUnalnCtg [options] @@ -21,15 +21,15 @@ Necessary input description: assembly_directory This directory should contain many sub-directories named by sample names, such as sample1, sample2,etc. In each sub-directory, assembly results, including - file *.contig, should exist. + file \"*.contig.gz\", should exist. QUAST_assess_directory This directory should contain many sub-directories named by sample names, such as sample1, sample2,etc. In each sub-directory, quast assessment, including - directory file contigs_reports, should exist. + directory file \"contigs_reports\", should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -37,8 +37,8 @@ Necessary input description: Options: -h Print this usage page. - -p The suffix of contigs file in assembly directory. - default: .contig.gz + -s The suffix of contigs file in assembly directory. + Default: \".contig.gz\" "; @@ -66,15 +66,12 @@ $quast_dir.="/" unless($quast_dir=~/\/$/); $out_dir.="/" unless($out_dir=~/\/$/); #Check existence of output directory if(-e $out_dir){ - die("Error: output directory \"$out_dir\" already exists. -To avoid overwriting of existing files. We kindly request that the - output directory should not exist. -"); + die("Error: output directory \"$out_dir\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } my $thread_num=1; my $suffix=".contig.gz"; -$suffix=$opt_p if defined $opt_p; +$suffix=$opt_s if defined $opt_s; #Create output directory and sub-directories mkdir($out_dir); my $out_data=$out_dir."data/"; @@ -207,7 +204,7 @@ Necessary input description: *.partially.contig and *.partially.coords, should exist. output_directory Results will be output to this directory.To avoid - overwriting of existing files. We kindly request + overwriting of existing files, we kindly request that the output_directory should not exist. It is to say, this directory will be created by the script itself. @@ -226,10 +223,7 @@ $out.="/" unless($out=~/\/$/); ##Check existence of output directory if(-e $out){ - die("Error: output directory \"$out\" already exists. - To avoid overwriting of existing files. We kindly request that the - output directory should not exist. - "); + die("Error: output directory \"$out\" already exists. To avoid overwriting of existing files, we kindly request that the output directory should not exist.\n"); } mkdir($out); diff --git a/src/R/Makefile b/src/R/Makefile index 8260234..695eb75 100644 --- a/src/R/Makefile +++ b/src/R/Makefile @@ -1,6 +1,6 @@ export BINDIR = ../../bin install: - @cp plotNovelSeq.R ${BINDIR};\ + @cp pav_plot.R ${BINDIR};\ cp plotNovelSeq.R ${BINDIR};\ cp plotQual.R ${BINDIR} diff --git a/src/bam2cov/Makefile b/src/bam2cov/Makefile index f2cbe64..95d710c 100644 --- a/src/bam2cov/Makefile +++ b/src/bam2cov/Makefile @@ -1,6 +1,6 @@ ../../bin/bam2cov: - g++ -O2 -o ../../bin/bam2cov bam2cov.cpp -lbamtools -L ../../lib/ -I ../../lib/include/ + g++ -O2 -o ../../bin/bam2cov bam2cov.cpp -lbamtools -L ../../lib/ -I ../../lib/include/ -D_GLIBCXX_USE_CXX11_ABI=0 clean: rm ../../bin/bam2cov debug: - g++ -g -O2 -o ../../bin/bam2cov bam2cov.cpp -lbamtools -L ../../lib/ -I ../../lib/include/ + g++ -g -O2 -o ../../bin/bam2cov bam2cov.cpp -lbamtools -L ../../lib/ -I ../../lib/include/ -D_GLIBCXX_USE_CXX11_ABI=0 diff --git a/src/ccov/CCOV.cpp b/src/ccov/CCOV.cpp index 21a0742..d615849 100644 --- a/src/ccov/CCOV.cpp +++ b/src/ccov/CCOV.cpp @@ -231,10 +231,10 @@ void checkGeneCoverage(map > &mapper, GTF >f){ // cerr << i <<"\t**"<chr].size()<chr][i-1]; + // alength+=mapper[giter->chr][i-1]; if(mapper[giter->chr][i-1]>0){ - slength++; - alength+=mapper[giter->chr][i-1]; + slength++; + alength+=mapper[giter->chr][i-1]; } } giter->geneDep=(double)alength/(double)tlength; diff --git a/src/ccov/Makefile b/src/ccov/Makefile index ef7f6ee..b0d7b5f 100644 --- a/src/ccov/Makefile +++ b/src/ccov/Makefile @@ -1,6 +1,6 @@ ../../bin/ccov: - g++ -O2 -o ../../bin/ccov CCOV.cpp GTF.cpp -lz -lbamtools -L ../../lib/ -I ../../lib/include/ + g++ -O2 -o ../../bin/ccov CCOV.cpp GTF.cpp -lz -lbamtools -L ../../lib/ -I ../../lib/include/ -D_GLIBCXX_USE_CXX11_ABI=0 clean: rm ../../bin/ccov debug: - g++ -g -o ../../bin/ccov CCOV.cpp GTF.cpp -lz -lbamtools -L ../../lib/ -I ../../lib/include/ + g++ -g -o ../../bin/ccov CCOV.cpp GTF.cpp -lz -lbamtools -L ../../lib/ -I ../../lib/include/ -D_GLIBCXX_USE_CXX11_ABI=0 diff --git a/src/perl/Makefile b/src/perl/Makefile index e1cedb0..0fb9e54 100644 --- a/src/perl/Makefile +++ b/src/perl/Makefile @@ -14,7 +14,7 @@ install: clean: @rm ${BINDIR}/hupan;\ - rm ${BINDIR}/hupanSLF;\ + rm ${BINDIR}/hupanLSF;\ rm ${BINDIR}/hupanSLURM;\ rm ${BINDIR}/adjCdhit.pl;\ rm ${BINDIR}/blastCluster.pl;\ diff --git a/src/perl/getTaxClass.pl b/src/perl/getTaxClass.pl index 234e5e7..10c108c 100755 --- a/src/perl/getTaxClass.pl +++ b/src/perl/getTaxClass.pl @@ -157,7 +157,7 @@ open(OUT,">$out_file")||die("Error10: cannot write the table: $out_file.\n"); print OUT "accesion\taccession.version\ttaxid\tsource\n"; while((my $key, my $value)=each %accession){ - print $key."\n"; + #print $key."\n"; my $taxonomy=$acc_tax{$key}; my $records; if(exists $source{$taxonomy}){ diff --git a/src/perl/getUnalnCtg.pl b/src/perl/getUnalnCtg.pl index a19fcbc..35cb7d8 100755 --- a/src/perl/getUnalnCtg.pl +++ b/src/perl/getUnalnCtg.pl @@ -33,7 +33,7 @@ "; -die $usage if @ARGV!=4; +die $usage if @ARGV!=5; die $usage if defined($opt_h); my ($contig_file,$info_file,$coords_file,$output_prefix,$prefix)=@ARGV; diff --git a/src/perl/linearK.pl b/src/perl/linearK.pl index ff7eaec..3317eff 100755 --- a/src/perl/linearK.pl +++ b/src/perl/linearK.pl @@ -15,7 +15,7 @@ Necessary input description: data_directory This directory should contain one or more pair of FASTQ files - with suffix of .fq or .fq.gz. The suffix can be changed with -s option. + with suffix of .fastq.gz or .fq.gz. The suffix can be changed with -s option. output_directory The output directory. @@ -29,10 +29,10 @@ Default: 1 -g Genome size. Used to infer sequencing depth. - Default: 380000000 (460M) + Default: 3000000000 (3Gb) -s Suffix of files within data_directory. - Default: .fq.gz + Default: .fastq.gz -r Parameters of linear function: Kmer=2*int(0.5*(a*Depth+b))+1. The parameter should be input as \"a,b\". @@ -98,11 +98,11 @@ $thread_num=$opt_t if defined($opt_t); #define file suffix -my $suffix=".fq.gz"; +my $suffix=".fastq.gz"; $suffix=$opt_s if defined($opt_s); #define genome size -my $gsize=380000000; +my $gsize=3000000000; $gsize=$opt_g if defined($opt_g); #define linear function of Kmer diff --git a/tools/MUMmer3.23/annotate b/tools/MUMmer3.23/annotate deleted file mode 100755 index 15f5577..0000000 Binary files a/tools/MUMmer3.23/annotate and /dev/null differ diff --git a/tools/MUMmer3.23/aux_bin/postnuc b/tools/MUMmer3.23/aux_bin/postnuc deleted file mode 100755 index dec8628..0000000 Binary files a/tools/MUMmer3.23/aux_bin/postnuc and /dev/null differ diff --git a/tools/MUMmer3.23/aux_bin/postpro b/tools/MUMmer3.23/aux_bin/postpro deleted file mode 100755 index 6c60deb..0000000 Binary files a/tools/MUMmer3.23/aux_bin/postpro and /dev/null differ diff --git a/tools/MUMmer3.23/aux_bin/prenuc b/tools/MUMmer3.23/aux_bin/prenuc deleted file mode 100755 index 042c3fb..0000000 Binary files a/tools/MUMmer3.23/aux_bin/prenuc and /dev/null differ diff --git a/tools/MUMmer3.23/aux_bin/prepro b/tools/MUMmer3.23/aux_bin/prepro deleted file mode 100755 index c44f858..0000000 Binary files a/tools/MUMmer3.23/aux_bin/prepro and /dev/null differ diff --git a/tools/MUMmer3.23/combineMUMs b/tools/MUMmer3.23/combineMUMs deleted file mode 100755 index c5fc016..0000000 Binary files a/tools/MUMmer3.23/combineMUMs and /dev/null differ diff --git a/tools/MUMmer3.23/delta-filter b/tools/MUMmer3.23/delta-filter deleted file mode 100755 index d517bd3..0000000 Binary files a/tools/MUMmer3.23/delta-filter and /dev/null differ diff --git a/tools/MUMmer3.23/dnadiff b/tools/MUMmer3.23/dnadiff deleted file mode 100755 index f7ad317..0000000 --- a/tools/MUMmer3.23/dnadiff +++ /dev/null @@ -1,839 +0,0 @@ -#!/usr/bin/perl -w - -#------------------------------------------------------------------------------- -# Programmer: Adam M Phillippy, University of Maryland -# File: dnadiff -# Date: 11 / 29 / 06 -# -# Try 'dnadiff -h' for more information. -# -#------------------------------------------------------------------------------- - -use lib "/export/home/zqhu/tools/MUMmer3.23/scripts"; -use Foundation; -use File::Spec::Functions; -use strict; - -my $BIN_DIR = "/export/home/zqhu/tools/MUMmer3.23"; -my $SCRIPT_DIR = "/export/home/zqhu/tools/MUMmer3.23/scripts"; - -my $VERSION_INFO = q~ -DNAdiff version 1.3 - ~; - -my $HELP_INFO = q~ - USAGE: dnadiff [options] - or dnadiff [options] -d - - DESCRIPTION: - Run comparative analysis of two sequence sets using nucmer and its - associated utilities with recommended parameters. See MUMmer - documentation for a more detailed description of the - output. Produces the following output files: - - .report - Summary of alignments, differences and SNPs - .delta - Standard nucmer alignment output - .1delta - 1-to-1 alignment from delta-filter -1 - .mdelta - M-to-M alignment from delta-filter -m - .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta - .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta - .snps - SNPs from show-snps -rlTHC .1delta - .rdiff - Classified ref breakpoints from show-diff -rH .mdelta - .qdiff - Classified qry breakpoints from show-diff -qH .mdelta - .unref - Unaligned reference IDs and lengths (if applicable) - .unqry - Unaligned query IDs and lengths (if applicable) - - MANDATORY: - reference Set the input reference multi-FASTA filename - query Set the input query multi-FASTA filename - or - delta file Unfiltered .delta alignment file from nucmer - - OPTIONS: - -d|delta Provide precomputed delta file for analysis - -h - --help Display help information and exit - -p|prefix Set the prefix of the output files (default "out") - -V - --version Display the version information and exit - ~; - - -my $USAGE_INFO = q~ - USAGE: dnadiff [options] - or dnadiff [options] -d - ~; - - -my @DEPEND_INFO = - ( - "$BIN_DIR/delta-filter", - "$BIN_DIR/show-diff", - "$BIN_DIR/show-snps", - "$BIN_DIR/show-coords", - "$BIN_DIR/nucmer", - "$SCRIPT_DIR/Foundation.pm" - ); - -my $DELTA_FILTER = "$BIN_DIR/delta-filter"; -my $SHOW_DIFF = "$BIN_DIR/show-diff"; -my $SHOW_SNPS = "$BIN_DIR/show-snps"; -my $SHOW_COORDS = "$BIN_DIR/show-coords"; -my $NUCMER = "$BIN_DIR/nucmer"; - -my $SNPBuff = 20; # required buffer around "good" snps -my $OPT_Prefix = "out"; # prefix for all output files -my $OPT_RefFile; # reference file -my $OPT_QryFile; # query file -my $OPT_DeltaFile; # unfiltered alignment file -my $OPT_ReportFile = ".report"; # report file -my $OPT_DeltaFile1 = ".1delta"; # 1-to-1 delta alignment -my $OPT_DeltaFileM = ".mdelta"; # M-to-M delta alignment -my $OPT_CoordsFile1 = ".1coords"; # 1-to-1 alignment coords -my $OPT_CoordsFileM = ".mcoords"; # M-to-M alignment coords -my $OPT_SnpsFile = ".snps"; # snps output file -my $OPT_DiffRFile = ".rdiff"; # diffile for R -my $OPT_DiffQFile = ".qdiff"; # diffile for Q -my $OPT_UnRefFile = ".unref"; # unaligned ref IDs and lengths -my $OPT_UnQryFile = ".unqry"; # unaligned qry IDs and lengths - -my $TIGR; # TIGR Foundation object - - -sub RunAlignment(); -sub RunFilter(); -sub RunCoords(); -sub RunSNPs(); -sub RunDiff(); -sub MakeReport(); - -sub FastaSizes($$); - -sub FileOpen($$); -sub FileClose($$); - -sub GetOpt(); - - -#--------------------------------------------------------------------- main ---- - main: -{ - GetOpt(); - - RunAlignment() unless defined($OPT_DeltaFile); - RunFilter(); - RunCoords(); - RunSNPs(); - RunDiff(); - MakeReport(); - - exit(0); -} - - -#------------------------------------------------------------- RunAlignment ---- -# Run nucmer -sub RunAlignment() -{ - print STDERR "Building alignments\n"; - my $cmd = "$NUCMER --maxmatch -p $OPT_Prefix $OPT_RefFile $OPT_QryFile"; - my $err = "ERROR: Failed to run nucmer, aborting.\n"; - - system($cmd) == 0 or die $err; - $OPT_DeltaFile = $OPT_Prefix . ".delta"; -} - - -#---------------------------------------------------------------- RunFilter ---- -# Run delta-filter -sub RunFilter() -{ - print STDERR "Filtering alignments\n"; - my $cmd1 = "$DELTA_FILTER -1 $OPT_DeltaFile > $OPT_DeltaFile1"; - my $cmd2 = "$DELTA_FILTER -m $OPT_DeltaFile > $OPT_DeltaFileM"; - my $err = "ERROR: Failed to run delta-filter, aborting.\n"; - - system($cmd1) == 0 or die $err; - system($cmd2) == 0 or die $err; -} - - -#------------------------------------------------------------------ RunSNPs ---- -# Run show-snps -sub RunSNPs() -{ - print STDERR "Analyzing SNPs\n"; - my $cmd = "$SHOW_SNPS -rlTHC $OPT_DeltaFile1 > $OPT_SnpsFile"; - my $err = "ERROR: Failed to run show-snps, aborting.\n"; - - system($cmd) == 0 or die $err; -} - - -#---------------------------------------------------------------- RunCoords ---- -# Run show-coords -sub RunCoords() -{ - print STDERR "Extracting alignment coordinates\n"; - my $cmd1 = "$SHOW_COORDS -rclTH $OPT_DeltaFile1 > $OPT_CoordsFile1"; - my $cmd2 = "$SHOW_COORDS -rclTH $OPT_DeltaFileM > $OPT_CoordsFileM"; - my $err = "ERROR: Failed to run show-coords, aborting.\n"; - - system($cmd1) == 0 or die $err; - system($cmd2) == 0 or die $err; -} - - -#------------------------------------------------------------------ RunDiff ---- -# Run show-diff -sub RunDiff() -{ - print STDERR "Extracting alignment breakpoints\n"; - my $cmd1 = "$SHOW_DIFF -rH $OPT_DeltaFileM > $OPT_DiffRFile"; - my $cmd2 = "$SHOW_DIFF -qH $OPT_DeltaFileM > $OPT_DiffQFile"; - my $err = "ERROR: Failed to run show-diff, aborting.\n"; - - system($cmd1) == 0 or die $err; - system($cmd2) == 0 or die $err; -} - - -#--------------------------------------------------------------- MakeReport ---- -# Output alignment report -sub MakeReport() -{ - print STDERR "Generating report file\n"; - - my ($fhi, $fho); # filehandle-in and filehandle-out - my (%refs, %qrys) = ((),()); # R and Q ID->length - my ($rqnAligns1, $rqnAlignsM) = (0,0); # alignment counter - my ($rSumLen1, $qSumLen1) = (0,0); # alignment length sum - my ($rSumLenM, $qSumLenM) = (0,0); # alignment length sum - my ($rqSumLen1, $rqSumLenM) = (0,0); # combined alignment length sum - my ($rqSumIdy1, $rqSumIdyM) = (0,0); # weighted alignment identity sum - my ($qnIns, $rnIns) = (0,0); # insertion count - my ($qSumIns, $rSumIns) = (0,0); # insertion length sum - my ($qnTIns, $rnTIns) = (0,0); # tandem insertion count - my ($qSumTIns, $rSumTIns) = (0,0); # tandem insertion length sum - my ($qnInv, $rnInv) = (0,0); # inversion count - my ($qnRel, $rnRel) = (0,0); # relocation count - my ($qnTrn, $rnTrn) = (0,0); # translocation count - my ($rnSeqs, $qnSeqs) = (0,0); # sequence count - my ($rnASeqs, $qnASeqs) = (0,0); # aligned sequence count - my ($rnBases, $qnBases) = (0,0); # bases count - my ($rnABases, $qnABases) = (0,0); # aligned bases count - my ($rnBrk, $qnBrk) = (0,0); # breakpoint count - my ($rqnSNPs, $rqnIndels) = (0,0); # snp and indel counts - my ($rqnGSNPs, $rqnGIndels) = (0,0); # good snp and indel counts - my %rqSNPs = # SNP hash - ( "."=>{"A"=>0,"C"=>0,"G"=>0,"T"=>0}, - "A"=>{"."=>0,"C"=>0,"G"=>0,"T"=>0}, - "C"=>{"."=>0,"A"=>0,"G"=>0,"T"=>0}, - "G"=>{"."=>0,"A"=>0,"C"=>0,"T"=>0}, - "T"=>{"."=>0,"A"=>0,"C"=>0,"G"=>0} ); - my %rqGSNPs = # good SNP hash - ( "."=>{"A"=>0,"C"=>0,"G"=>0,"T"=>0}, - "A"=>{"."=>0,"C"=>0,"G"=>0,"T"=>0}, - "C"=>{"."=>0,"A"=>0,"G"=>0,"T"=>0}, - "G"=>{"."=>0,"A"=>0,"C"=>0,"T"=>0}, - "T"=>{"."=>0,"A"=>0,"C"=>0,"G"=>0} ); - - my $header; # delta header - - #-- Get delta header - $fhi = FileOpen("<", $OPT_DeltaFile); - $header .= <$fhi>; - $header .= <$fhi>; - $header .= "\n"; - FileClose($fhi, $OPT_DeltaFile); - - #-- Collect all reference and query IDs and lengths - FastaSizes($OPT_RefFile, \%refs); - FastaSizes($OPT_QryFile, \%qrys); - - #-- Count ref and qry seqs and lengths - foreach ( values(%refs) ) { - $rnSeqs++; - $rnBases += $_; - } - foreach ( values(%qrys) ) { - $qnSeqs++; - $qnBases += $_; - } - - #-- Count aligned seqs, aligned bases, and breakpoints for each R and Q - $fhi = FileOpen("<", $OPT_CoordsFileM); - while (<$fhi>) { - chomp; - my @A = split "\t"; - scalar(@A) == 13 - or die "ERROR: Unrecognized format $OPT_CoordsFileM, aborting.\n"; - - #-- Add to M-to-M alignment counts - $rqnAlignsM++; - $rSumLenM += $A[4]; - $qSumLenM += $A[5]; - $rqSumIdyM += ($A[6] / 100.0) * ($A[4] + $A[5]); - $rqSumLenM += ($A[4] + $A[5]); - - #-- If new ID, add to sequence and base count - if ( $refs{$A[11]} > 0 ) { - $rnASeqs++; - $rnABases += $refs{$A[11]}; - $refs{$A[11]} *= -1; # If ref has alignment, length will be -neg - } - if ( $qrys{$A[12]} > 0 ) { - $qnASeqs++; - $qnABases += $qrys{$A[12]}; - $qrys{$A[12]} *= -1; # If qry has alignment, length will be -neg - } - - #-- Add to breakpoint counts - my ($lo, $hi); - if ( $A[0] < $A[1] ) { $lo = $A[0]; $hi = $A[1]; } - else { $lo = $A[1]; $hi = $A[0]; } - $rnBrk++ if ( $lo != 1 ); - $rnBrk++ if ( $hi != $A[7] ); - - if ( $A[2] < $A[3] ) { $lo = $A[2]; $hi = $A[3]; } - else { $lo = $A[3]; $hi = $A[2]; } - $qnBrk++ if ( $lo != 1 ); - $qnBrk++ if ( $hi != $A[8] ); - } - FileClose($fhi, $OPT_CoordsFileM); - - #-- Calculate average %idy, length, etc. - $fhi = FileOpen("<", $OPT_CoordsFile1); - while (<$fhi>) { - chomp; - my @A = split "\t"; - scalar(@A) == 13 - or die "ERROR: Unrecognized format $OPT_CoordsFile1, aborting.\n"; - - #-- Add to 1-to-1 alignment counts - $rqnAligns1++; - $rSumLen1 += $A[4]; - $qSumLen1 += $A[5]; - $rqSumIdy1 += ($A[6] / 100.0) * ($A[4] + $A[5]); - $rqSumLen1 += ($A[4] + $A[5]); - } - FileClose($fhi, $OPT_CoordsFile1); - - #-- If you are reading this, you need to get out more... - - #-- Count reference diff features and indels - $fhi = FileOpen("<", $OPT_DiffRFile); - while (<$fhi>) { - chomp; - my @A = split "\t"; - defined($A[4]) - or die "ERROR: Unrecognized format $OPT_DiffRFile, aborting.\n"; - my $gap = $A[4]; - my $ins = $gap; - - #-- Add to tandem insertion counts - if ( $A[1] eq "GAP" ) { - scalar(@A) == 7 - or die "ERROR: Unrecognized format $OPT_DiffRFile, aborting.\n"; - $ins = $A[6] if ( $A[6] > $gap ); - if ( $A[4] <= 0 && $A[5] <= 0 && $A[6] > 0 ) { - $rnTIns++; - $rSumTIns += $A[6]; - } - } - - #-- Remove unaligned sequence from count - if ( $A[1] ne "DUP" ) { - $rnABases -= $gap if ( $gap > 0 ); - } - - #-- Add to insertion count - if ( $ins > 0 ) { - $rnIns++; - $rSumIns += $ins; - } - - #-- Add to rearrangement counts - $rnInv++ if ( $A[1] eq "INV" ); - $rnRel++ if ( $A[1] eq "JMP" ); - $rnTrn++ if ( $A[1] eq "SEQ" ); - } - FileClose($fhi, $OPT_DiffRFile); - - #-- Count query diff features and indels - $fhi = FileOpen("<", $OPT_DiffQFile); - while (<$fhi>) { - chomp; - my @A = split "\t"; - defined($A[4]) - or die "ERROR: Unrecognized format $OPT_DiffRFile, aborting.\n"; - my $gap = $A[4]; - my $ins = $gap; - - #-- Add to tandem insertion counts - if ( $A[1] eq "GAP" ) { - scalar(@A) == 7 - or die "ERROR: Unrecognized format $OPT_DiffRFile, aborting.\n"; - $ins = $A[6] if ( $A[6] > $gap ); - if ( $A[4] <= 0 && $A[5] <= 0 && $A[6] > 0 ) { - $qnTIns++; - $qSumTIns += $A[6]; - } - } - - #-- Remove unaligned sequence from count - if ( $A[1] ne "DUP" ) { - $qnABases -= $gap if ( $gap > 0 ); - } - - #-- Add to insertion count - if ( $ins > 0 ) { - $qnIns++; - $qSumIns += $ins; - } - - #-- Add to rearrangement counts - $qnInv++ if ( $A[1] eq "INV" ); - $qnRel++ if ( $A[1] eq "JMP" ); - $qnTrn++ if ( $A[1] eq "SEQ" ); - } - FileClose($fhi, $OPT_DiffQFile); - - #-- Count SNPs - $fhi = FileOpen("<", $OPT_SnpsFile); - while(<$fhi>) { - chomp; - my @A = split "\t"; - scalar(@A) == 12 - or die "ERROR: Unrecognized format $OPT_SnpsFile, aborting\n"; - - my $r = uc($A[1]); - my $q = uc($A[2]); - - #-- Plain SNPs - $rqSNPs{$r}{$q}++; - if ( !exists($rqSNPs{$q}{$r}) ) { $rqSNPs{$q}{$r} = 0; } - if ( $r eq '.' || $q eq '.' ) { $rqnIndels++; } - else { $rqnSNPs++; } - - #-- Good SNPs with sufficient match buffer - if ( $A[4] >= $SNPBuff ) { - $rqGSNPs{$r}{$q}++; - if ( !exists($rqGSNPs{$q}{$r}) ) { $rqGSNPs{$q}{$r} = 0; } - if ( $r eq '.' || $q eq '.' ) { $rqnGIndels++; } - else { $rqnGSNPs++; } - } - } - FileClose($fhi, $OPT_SnpsFile); - - - #-- Output report - $fho = FileOpen(">", $OPT_ReportFile); - - print $fho $header; - printf $fho "%-15s %20s %20s\n", "", "[REF]", "[QRY]"; - - print $fho "[Sequences]\n"; - - printf $fho "%-15s %20d %20d\n", - "TotalSeqs", $rnSeqs, $qnSeqs; - printf $fho "%-15s %20s %20s\n", - "AlignedSeqs", - ( sprintf "%10d(%.2f%%)", - $rnASeqs, ($rnSeqs ? $rnASeqs / $rnSeqs * 100.0 : 0) ), - ( sprintf "%10d(%.2f%%)", - $qnASeqs, ($rnSeqs ? $qnASeqs / $qnSeqs * 100.0 : 0) ); - printf $fho "%-15s %20s %20s\n", - "UnalignedSeqs", - ( sprintf "%10d(%.2f%%)", - $rnSeqs - $rnASeqs, - ($rnSeqs ? ($rnSeqs - $rnASeqs) / $rnSeqs * 100.0 : 0) ), - ( sprintf "%10d(%.2f%%)", - $qnSeqs - $qnASeqs, - ($qnSeqs ? ($qnSeqs - $qnASeqs) / $qnSeqs * 100.0 : 0) ); - - print $fho "\n[Bases]\n"; - - printf $fho "%-15s %20d %20d\n", - "TotalBases", $rnBases, $qnBases; - printf $fho "%-15s %20s %20s\n", - "AlignedBases", - ( sprintf "%10d(%.2f%%)", - $rnABases, ($rnBases ? $rnABases / $rnBases * 100.0 : 0) ), - ( sprintf "%10d(%.2f%%)", - $qnABases, ($qnBases ? $qnABases / $qnBases * 100.0 : 0) ); - printf $fho "%-15s %20s %20s\n", - "UnalignedBases", - ( sprintf "%10d(%.2f%%)", - $rnBases - $rnABases, - ($rnBases ? ($rnBases - $rnABases) / $rnBases * 100.0 : 0) ), - ( sprintf "%10d(%.2f%%)", - $qnBases - $qnABases, - ($qnBases ? ($qnBases - $qnABases) / $qnBases * 100.0 : 0) ); - - print $fho "\n[Alignments]\n"; - - printf $fho "%-15s %20d %20d\n", - "1-to-1", $rqnAligns1, $rqnAligns1; - printf $fho "%-15s %20d %20d\n", - "TotalLength", $rSumLen1, $qSumLen1; - printf $fho "%-15s %20.2f %20.2f\n", - "AvgLength", - ($rqnAligns1 ? $rSumLen1 / $rqnAligns1 : 0), - ($rqnAligns1 ? $qSumLen1 / $rqnAligns1 : 0); - printf $fho "%-15s %20.2f %20.2f\n", - "AvgIdentity", - ($rqSumLen1 ? $rqSumIdy1 / $rqSumLen1 * 100.0 : 0), - ($rqSumLen1 ? $rqSumIdy1 / $rqSumLen1 * 100.0 : 0); - - print $fho "\n"; - - printf $fho "%-15s %20d %20d\n", - "M-to-M", $rqnAlignsM, $rqnAlignsM; - printf $fho "%-15s %20d %20d\n", - "TotalLength", $rSumLenM, $qSumLenM; - printf $fho "%-15s %20.2f %20.2f\n", - "AvgLength", - ($rqnAlignsM ? $rSumLenM / $rqnAlignsM : 0), - ($rqnAlignsM ? $qSumLenM / $rqnAlignsM : 0); - printf $fho "%-15s %20.2f %20.2f\n", - "AvgIdentity", - ($rqSumLenM ? $rqSumIdyM / $rqSumLenM * 100.0 : 0), - ($rqSumLenM ? $rqSumIdyM / $rqSumLenM * 100.0 : 0); - - print $fho "\n[Feature Estimates]\n"; - - printf $fho "%-15s %20d %20d\n", - "Breakpoints", $rnBrk, $qnBrk; - printf $fho "%-15s %20d %20d\n", - "Relocations", $rnRel, $qnRel; - printf $fho "%-15s %20d %20d\n", - "Translocations", $rnTrn, $qnTrn; - printf $fho "%-15s %20d %20d\n", - "Inversions", $rnInv, $qnInv; - - print $fho "\n"; - - printf $fho "%-15s %20d %20d\n", - "Insertions", $rnIns, $qnIns; - printf $fho "%-15s %20d %20d\n", - "InsertionSum", $rSumIns, $qSumIns; - printf $fho "%-15s %20.2f %20.2f\n", - "InsertionAvg", - ($rnIns ? $rSumIns / $rnIns : 0), - ($qnIns ? $qSumIns / $qnIns : 0); - - print $fho "\n"; - - printf $fho "%-15s %20d %20d\n", - "TandemIns", $rnTIns, $qnTIns; - printf $fho "%-15s %20d %20d\n", - "TandemInsSum", $rSumTIns, $qSumTIns; - printf $fho "%-15s %20.2f %20.2f\n", - "TandemInsAvg", - ($rnTIns ? $rSumTIns / $rnTIns : 0), - ($qnTIns ? $qSumTIns / $qnTIns : 0); - - print $fho "\n[SNPs]\n"; - - printf $fho "%-15s %20d %20d\n", - "TotalSNPs", $rqnSNPs, $rqnSNPs; - foreach my $r (keys %rqSNPs) { - foreach my $q (keys %{$rqSNPs{$r}}) { - if ( $r ne "." && $q ne "." ) { - printf $fho "%-15s %20s %20s\n", - "$r$q", - ( sprintf "%10d(%.2f%%)", - $rqSNPs{$r}{$q}, - ($rqnSNPs ? $rqSNPs{$r}{$q} / $rqnSNPs * 100.0 : 0) ), - ( sprintf "%10d(%.2f%%)", - $rqSNPs{$q}{$r}, - ($rqnSNPs ? $rqSNPs{$q}{$r} / $rqnSNPs * 100.0 : 0) ); - } - } - } - - print $fho "\n"; - - printf $fho "%-15s %20d %20d\n", - "TotalGSNPs", $rqnGSNPs, $rqnGSNPs; - foreach my $r (keys %rqGSNPs) { - foreach my $q (keys %{$rqGSNPs{$r}}) { - if ( $r ne "." && $q ne "." ) { - printf $fho "%-15s %20s %20s\n", - "$r$q", - ( sprintf "%10d(%.2f%%)", - $rqGSNPs{$r}{$q}, - ($rqnGSNPs ? $rqGSNPs{$r}{$q} / $rqnGSNPs * 100.0 : 0) ), - ( sprintf "%10d(%.2f%%)", - $rqGSNPs{$q}{$r}, - ($rqnGSNPs ? $rqGSNPs{$q}{$r} / $rqnGSNPs * 100.0 : 0) ); - } - } - } - - print $fho "\n"; - - printf $fho "%-15s %20d %20d\n", - "TotalIndels", $rqnIndels, $rqnIndels; - foreach my $r (keys %rqSNPs) { - foreach my $q (keys %{$rqSNPs{$r}}) { - if ( $q eq "." ) { - printf $fho "%-15s %20s %20s\n", - "$r$q", - ( sprintf "%10d(%.2f%%)", - $rqSNPs{$r}{$q}, - ($rqnIndels ? $rqSNPs{$r}{$q} / $rqnIndels * 100.0 : 0) ), - ( sprintf "%10d(%.2f%%)", - $rqSNPs{$q}{$r}, - ($rqnIndels ? $rqSNPs{$q}{$r} / $rqnIndels * 100.0 : 0) ); - } - } - } - foreach my $r (keys %rqSNPs) { - foreach my $q (keys %{$rqSNPs{$r}}) { - if ( $r eq "." ) { - printf $fho "%-15s %20s %20s\n", - "$r$q", - ( sprintf "%10d(%.2f%%)", - $rqSNPs{$r}{$q}, - ($rqnIndels ? $rqSNPs{$r}{$q} / $rqnIndels * 100.0 : 0) ), - ( sprintf "%10d(%.2f%%)", - $rqSNPs{$q}{$r}, - ($rqnIndels ? $rqSNPs{$q}{$r} / $rqnIndels * 100.0 : 0) ); - } - } - } - - print $fho "\n"; - - printf $fho "%-15s %20d %20d\n", - "TotalGIndels", $rqnGIndels, $rqnGIndels; - foreach my $r (keys %rqGSNPs) { - foreach my $q (keys %{$rqGSNPs{$r}}) { - if ( $q eq "." ) { - printf $fho "%-15s %20s %20s\n", - "$r$q", - ( sprintf "%10d(%.2f%%)", - $rqGSNPs{$r}{$q}, - ($rqnGIndels ? $rqGSNPs{$r}{$q} / $rqnGIndels * 100.0 : 0) ), - ( sprintf "%10d(%.2f%%)", - $rqGSNPs{$q}{$r}, - ($rqnGIndels ? $rqGSNPs{$q}{$r} / $rqnGIndels * 100.0 : 0) ); - } - } - } - foreach my $r (keys %rqGSNPs) { - foreach my $q (keys %{$rqGSNPs{$r}}) { - if ( $r eq "." ) { - printf $fho "%-15s %20s %20s\n", - "$r$q", - ( sprintf "%10d(%.2f%%)", - $rqGSNPs{$r}{$q}, - ($rqnGIndels ? $rqGSNPs{$r}{$q} / $rqnGIndels * 100.0 : 0) ), - ( sprintf "%10d(%.2f%%)", - $rqGSNPs{$q}{$r}, - ($rqnGIndels ? $rqGSNPs{$q}{$r} / $rqnGIndels * 100.0 : 0) ); - } - } - } - - FileClose($fho, $OPT_ReportFile); - - - #-- Output unaligned reference and query IDs, if applicable - if ( $rnSeqs != $rnASeqs ) { - $fho = FileOpen(">", $OPT_UnRefFile); - while ( my ($key, $val) = each(%refs) ) { - print $fho "$key\tUNI\t1\t$val\t$val\n" unless $val < 0; - } - FileClose($fho, $OPT_UnRefFile); - } - if ( $qnSeqs != $qnASeqs ) { - $fho = FileOpen(">", $OPT_UnQryFile); - while ( my ($key, $val) = each(%qrys) ) { - print $fho "$key\tUNI\t1\t$val\t$val\n" unless $val < 0; - } - FileClose($fho, $OPT_UnQryFile); - } -} - - -#--------------------------------------------------------------- FastaSizes ---- -# Compute lengths for a multi-fasta file and store in hash reference -sub FastaSizes($$) -{ - - my $file = shift; - my $href = shift; - my ($tag, $len); - - my $fhi = FileOpen("<", $file); - while (<$fhi>) { - chomp; - - if ( /^>/ ) { - $href->{$tag} = $len if defined($tag); - ($tag) = /^>(\S+)/; - $len = 0; - } else { - if ( /\s/ ) { - die "ERROR: Whitespace found in FastA $file, aborting.\n"; - } - $len += length; - } - } - $href->{$tag} = $len if defined($tag); - FileClose($fhi, $file); -} - - -#----------------------------------------------------------------- FileOpen ---- -# Open file, return filehandle, or die -sub FileOpen($$) -{ - my ($mode, $name) = @_; - my $fhi; - open($fhi, $mode, $name) - or die "ERROR: Could not open $name, aborting. $!\n"; - return $fhi; -} - - -#---------------------------------------------------------------- FileClose ---- -# Close file, or die -sub FileClose($$) -{ - my ($fho, $name) = @_; - close($fho) or die "ERROR: Could not close $name, aborting. $!\n" -} - - -#------------------------------------------------------------------- GetOpt ---- -# Get command options and check file permissions -sub GetOpt() -{ - #-- Initialize TIGR::Foundation - $TIGR = new TIGR::Foundation; - if ( !defined($TIGR) ) { - print STDERR "ERROR: TIGR::Foundation could not be initialized"; - exit(1); - } - - #-- Set help and usage information - $TIGR->setHelpInfo($HELP_INFO); - $TIGR->setUsageInfo($USAGE_INFO); - $TIGR->setVersionInfo($VERSION_INFO); - $TIGR->addDependInfo(@DEPEND_INFO); - - #-- Get options - my $err = !$TIGR->TIGR_GetOptions - ( - "d|delta=s" => \$OPT_DeltaFile, - "p|prefix=s" => \$OPT_Prefix, - ); - - #-- Check if the parsing was successful - if ( $err - || (defined($OPT_DeltaFile) && scalar(@ARGV) != 0) - || (!defined($OPT_DeltaFile) && scalar(@ARGV) != 2) ) { - $TIGR->printUsageInfo(); - print STDERR "Try '$0 -h' for more information.\n"; - exit(1); - } - - my @errs; - - $TIGR->isExecutableFile($DELTA_FILTER) - or push(@errs, $DELTA_FILTER); - - $TIGR->isExecutableFile($SHOW_DIFF) - or push(@errs, $SHOW_DIFF); - - $TIGR->isExecutableFile($SHOW_SNPS) - or push(@errs, $SHOW_SNPS); - - $TIGR->isExecutableFile($SHOW_COORDS) - or push(@errs, $SHOW_COORDS); - - $TIGR->isExecutableFile($NUCMER) - or push(@errs, $NUCMER); - - if ( defined($OPT_DeltaFile) ) { - $TIGR->isReadableFile($OPT_DeltaFile) - or push(@errs, $OPT_DeltaFile); - - my $fhi = FileOpen("<", $OPT_DeltaFile); - $_ = <$fhi>; - FileClose($fhi, $OPT_DeltaFile); - - ($OPT_RefFile, $OPT_QryFile) = /^(.+) (.+)$/; - } - else { - $OPT_RefFile = File::Spec->rel2abs($ARGV[0]); - $OPT_QryFile = File::Spec->rel2abs($ARGV[1]); - } - - $TIGR->isReadableFile($OPT_RefFile) - or push(@errs, $OPT_RefFile); - - $TIGR->isReadableFile($OPT_QryFile) - or push(@errs, $OPT_QryFile); - - $OPT_ReportFile = $OPT_Prefix . $OPT_ReportFile; - $TIGR->isCreatableFile("$OPT_ReportFile") - or $TIGR->isWritableFile("$OPT_ReportFile") - or push(@errs, "$OPT_ReportFile"); - - $OPT_DeltaFile1 = $OPT_Prefix . $OPT_DeltaFile1; - $TIGR->isCreatableFile("$OPT_DeltaFile1") - or $TIGR->isWritableFile("$OPT_DeltaFile1") - or push(@errs, "$OPT_DeltaFile1"); - - $OPT_DeltaFileM = $OPT_Prefix . $OPT_DeltaFileM; - $TIGR->isCreatableFile("$OPT_DeltaFileM") - or $TIGR->isWritableFile("$OPT_DeltaFileM") - or push(@errs, "$OPT_DeltaFileM"); - - $OPT_CoordsFile1 = $OPT_Prefix . $OPT_CoordsFile1; - $TIGR->isCreatableFile("$OPT_CoordsFile1") - or $TIGR->isWritableFile("$OPT_CoordsFile1") - or push(@errs, "$OPT_CoordsFile1"); - - $OPT_CoordsFileM = $OPT_Prefix . $OPT_CoordsFileM; - $TIGR->isCreatableFile("$OPT_CoordsFileM") - or $TIGR->isWritableFile("$OPT_CoordsFileM") - or push(@errs, "$OPT_CoordsFileM"); - - $OPT_SnpsFile = $OPT_Prefix . $OPT_SnpsFile; - $TIGR->isCreatableFile("$OPT_SnpsFile") - or $TIGR->isWritableFile("$OPT_SnpsFile") - or push(@errs, "$OPT_SnpsFile"); - - $OPT_DiffRFile = $OPT_Prefix . $OPT_DiffRFile; - $TIGR->isCreatableFile("$OPT_DiffRFile") - or $TIGR->isWritableFile("$OPT_DiffRFile") - or push(@errs, "$OPT_DiffRFile"); - - $OPT_DiffQFile = $OPT_Prefix . $OPT_DiffQFile; - $TIGR->isCreatableFile("$OPT_DiffQFile") - or $TIGR->isWritableFile("$OPT_DiffQFile") - or push(@errs, "$OPT_DiffQFile"); - - $OPT_UnRefFile = $OPT_Prefix . $OPT_UnRefFile; - $TIGR->isCreatableFile("$OPT_UnRefFile") - or $TIGR->isWritableFile("$OPT_UnRefFile") - or push(@errs, "$OPT_UnRefFile"); - - $OPT_UnQryFile = $OPT_Prefix . $OPT_UnQryFile; - $TIGR->isCreatableFile("$OPT_UnQryFile") - or $TIGR->isWritableFile("$OPT_UnQryFile") - or push(@errs, "$OPT_UnQryFile"); - - if ( scalar(@errs) ) { - print STDERR "ERROR: The following critical files could not be used\n"; - while ( scalar(@errs) ) { print(STDERR pop(@errs),"\n"); } - print STDERR "Check your paths and file permissions and try again\n"; - exit(1); - } -} diff --git a/tools/MUMmer3.23/exact-tandems b/tools/MUMmer3.23/exact-tandems deleted file mode 100755 index d24ec3e..0000000 --- a/tools/MUMmer3.23/exact-tandems +++ /dev/null @@ -1,23 +0,0 @@ -#!/bin/csh -f -# -# Find exact tandem repeats in specified file involving an -# exact duplicate of at least the specified length - -set filename = $1 -set matchlen = $2 - -set bindir = /export/home/zqhu/tools/MUMmer3.23 -set scriptdir = /export/home/zqhu/tools/MUMmer3.23/scripts - -if ($filename == '' || $matchlen == '') then - echo "USAGE: $0 " - exit -1 -endif - -echo "Finding matches" -$bindir/repeat-match -t -n $matchlen $filename | tail +3 > $$.tmp.matches -if ($status != 0) exit -1 - -echo "Tandem repeats" -sort -k1n -k2n $$.tmp.matches | awk -f $scriptdir/tandem-repeat.awk -rm -f $$.tmp.matches diff --git a/tools/MUMmer3.23/gaps b/tools/MUMmer3.23/gaps deleted file mode 100755 index 17eddbd..0000000 Binary files a/tools/MUMmer3.23/gaps and /dev/null differ diff --git a/tools/MUMmer3.23/mapview b/tools/MUMmer3.23/mapview deleted file mode 100755 index 85dd7db..0000000 --- a/tools/MUMmer3.23/mapview +++ /dev/null @@ -1,967 +0,0 @@ -#!/usr/bin/perl - -use lib "/export/home/zqhu/tools/MUMmer3.23/scripts"; -use Foundation; - -my $SCRIPT_DIR = "/export/home/zqhu/tools/MUMmer3.23/scripts"; - - -my $VERSION_INFO = q~ -mapview version 1.01 - ~; - - -my $HELP_INFO = q~ - USAGE: mapview [options] [UTR coords] [CDS coords] - - DESCRIPTION: - mapview is a utility program for displaying sequence alignments as - provided by MUMmer, NUCmer, PROmer or Mgaps. mapview takes the output of - show-coords and converts it to a FIG, PDF or PS file for visual analysis. - It can also break the output into multiple files for easier viewing and - printing. - - MANDATORY: - coords file The output of 'show-coords -rl[k]' or 'mgaps' - - OPTIONS: - UTR coords UTR coordinate file in GFF format - CDS coords CDS coordinate file in GFF format - - -d|maxdist Set the maximum base-pair distance between linked matches - (default 50000) - -f|format Set the output format to 'pdf', 'ps' or 'fig' - (default 'fig') - -h - --help Display help information and exit - -m|mag Set the magnification at which the figure is rendered, - this is an option for fig2dev which is used to generate - the PDF and PS files (default 1.0) - -n|num Set the number of output files used to partition the - output, this is to avoid generating files that are too - large to display (default 10) - -p|prefix Set the output file prefix - (default "PROMER_graph or NUCMER_graph") - -v - --verbose Verbose logging of the processed files - -V - --version Display the version information and exit - -x1 coord Set the lower coordinate bound of the display - -x2 coord Set the upper coordinate bound of the display - -g|ref If the input file is provided by 'mgaps', set the - reference sequence ID (as it appears in the first column - of the UTR/CDS coords file) - -I Display the name of query sequences - -Ir Display the name of reference genes - ~; - - -my $USAGE_INFO = q~ - USAGE: mapview [options] [UTR coords] [CDS coords] - ~; - - -my @DEPEND_INFO = - ( - "fig2dev", - "$SCRIPT_DIR/Foundation.pm" - ); - -my $err_gff = q~ - ERROR in the input files ! The reference seq ID can't be found in GFF files ! - The first column in the GFF file should be the ID of the reference seq. - The alignments file should provide the same info in the column before the last one. - - Here are some example records for the GFF file: - - gnl|FlyBase|X Dmel3 initial-exon 2155 2413 . - . X_CG3038.1 - gnl|FlyBase|X Dmel3 last-exon 1182 2077 . - . X_CG3038.1 - ... - The fields are : - - ~; - -my $tigr; -my $err; - -my $alignm; -my $futr; -my $fcds; - -#-- Initialize TIGR::Foundation -$tigr = new TIGR::Foundation; -if ( !defined ($tigr) ) { - print (STDERR "ERROR: TIGR::Foundation could not be initialized"); - exit (1); -} - -#-- Set help and usage information -$tigr->setHelpInfo ($HELP_INFO); -$tigr->setUsageInfo ($USAGE_INFO); -$tigr->setVersionInfo ($VERSION_INFO); -$tigr->addDependInfo (@DEPEND_INFO); - -$err = $tigr->TIGR_GetOptions - ( - "d|maxdist=i" => \$match_dist, - "f|format=s" => \$format, - "m|mag=f" => \$magn, - "n|num=i" => \$noOutfiles, - "p|prefix=s" => \$outfilename, - "x1=i" => \$x1win, - "x2=i" => \$x2win, - "v|verbose" => \$verb, - "g|ref=s" => \$Mgaps, - "I" => \$printIDconting, - "Ir" => \$printIDgenes - ); - -if ( $err == 0 || scalar(@ARGV) < 1 || scalar(@ARGV) > 3 ) { - $tigr->printUsageInfo( ); - print (STDERR "Try '$0 -h' for more information.\n"); - exit (1); -} - -($alignm,$futr,$fcds)=@ARGV; - -if ((substr($x1win,0,1) eq '-') || (substr($x2win,0,1) eq '-')){ - print "ERROR2 : coords x1,x2 should be positive integers !!\n"; - $info=1; -} -if ($x1win > $x2win) { - print "ERROR3 : wrong range coords : x1 >= x2 !!!\n"; - $info=1; -} -if ($Mgaps){ #formating the mgaps output to be similar with show-coords output - format_mgaps(); -} - -if (!$format){$format="fig";} -if (($x1win) and ($x2win)){ $startfind=0; } -else{ $startfind=1; } -$endfind=0; -if (!$noOutfiles) {$noOutfiles=10;} -if (!$match_dist){$match_dist=50000;} - -#```````init colors```````````````````````````` -$color{"2"}=27;#dark pink 5utr -$color{"3"}=2;#green ex -$color{"4"}=1;#blue 3utr -#`````````````````````` -@linkcolors=(31,14,11); -#`````````````````````````````````````````````` -open(F,$alignm); -; -$prog=; chomp($prog); -; -$_=; -@a=(m/\s+(\||\[.+?\])/g) ; -for ($ind=0;$ind<=$#a;$ind++){ - if ($a[$ind] eq "[S1]") { - $ind_s1=$ind; - } - elsif ($a[$ind] eq "[E1]"){ - $ind_e1=$ind; - } - elsif ($a[$ind] eq "[% IDY]"){ - $ind_pidy=$ind; - } - elsif ($a[$ind] eq "[LEN R]"){ - $ind_lenchr=$ind; - } -# elsif ($a[$ind] eq "[TAGS]"){ -# $ind_tags=$ind+1; #there are two columns for this header col -# } -} -;$mref=-1; -while (){ - chomp; - @a=split; - if (!exists $hRefContigId{$a[-2]}) { # print $a[-2]."\n"; - $hRefContigId{$a[-2]}=$a[$ind_lenchr]; - $lenrefseqs+=$a[$ind_lenchr]; - $mref++; - } -} -$nobpinfile=int($lenrefseqs/$noOutfiles); - -close(F); - -#`````````````````````` - - -if (@ARGV > 1) { - get_cds_ends(); - get_utrcds_info(); - test_overlap(); -} -elsif(!$mref) { - $fileno=$noOutfiles; - $startcoord=0;$endcoord=0; - for ($i=0;$i<$fileno;$i++) { - $endcoord=$startcoord+$nobpinfile-1; - $endcoord=$lenrefseqs if ($endcoord>$lenrefseqs); - $file[$i]="$startcoord $endcoord"; - $startcoord=$endcoord+1; - } -} - - -$Yorig=3000; -$YdistPID=2000; -$yscale=$YdistPID/50; -$Xscale=14.5; -$gap=800; -#$maxfiles = ($fileno < 10) ? $fileno : 10; -#--------------------------------- - -if (!$mref){ - for($i=0; $i < $fileno; $i++) { - $nrf=$i; - set_output_fname(); - ($startcoord,$endcoord)=split(/\s+/,$file[$i]); - open(O,">$procfile".$nrf.".fig"); - print_header(); - print $procfile.$nrf.".fig\t range : $startcoord\t$endcoord \n" if ($verb && ($format eq "fig")); - - $xs=0; - $xe=int(($endcoord-$startcoord+1)/$Xscale); - #$xs=200; - #$xe=$xs+int(($endcoord-$startcoord+1)/$Xscale); - print_grid($xs,$xe,$startcoord,$endcoord); - - $tmpIdQrycontig=""; - $linkcolor=$linkcolors[0]; - - open(F,$alignm); - ;;;;; - while() { - chomp; - @a=split; - if($a[$ind_s1] > $endcoord) { last;} - if($a[$ind_s1]<$startcoord && $a[$ind_e1] > $startcoord ) { $a[$ind_s1]=$startcoord;} - if($a[$ind_s1] < $endcoord && $endcoord < $a[$ind_e1]) { $a[$ind_e1]=$endcoord;} - - if($a[$ind_s1]>=$startcoord && $a[$ind_e1]<=$endcoord) { - $x1=int(($a[$ind_s1]-$startcoord)/$Xscale);# - $x2=int(($a[$ind_e1]-$startcoord)/$Xscale);# - print_align($x1,$x2); - } - } - close(F); - %hQrycontig=(); - print_genes() if ($futr); - print_legend(); - close(O); - change_file_format() if ($format ne "fig"); - } -} -elsif($mref){#multiple ref seqs - set_output_fname(); - $tmpIdQrycontig=""; - $linkcolor=$linkcolors[0]; - $startdrawX=0; - $proclen=0; - $first=1; - $nrf=0; - open(F,$alignm); - ;;;;; - while() { - chomp; - @a=split; - if ($a[-2] ne $tmpcontig){ - %hQrycontig=(); - $tmpcontig=$a[-2]; - if ($first){ - $first=0; - $nrf++; - open(O,">$procfile".$nrf.".fig"); - print_header(); - print $procfile.$nrf.".fig"."\n" if ($verb && ($format eq "fig")); - $len=$hRefContigId{$a[-2]}; - } - else { - $startdrawX+=int($len/$Xscale)+$gap; - $len=$hRefContigId{$a[-2]}; - if (($proclen+$len>$nobpinfile) and ($proclen != 0)){ - print_legend(); - close(O); - change_file_format() if ($format ne "fig"); - $nrf++; - open(O,">$procfile".$nrf.".fig"); - print_header(); - print "\n".$procfile.$nrf.".fig"."\n" if ($verb && ($format eq "fig")); - $proclen=0; - $startdrawX=0; - } - } - $xs=$startdrawX; - $xe=$startdrawX+int($len/$Xscale); - print_grid($xs,$xe,0,$len); - #print genes from %geneinfo for contig - print $a[-2]."\t".$hRefContigId{$a[-2]}."\n"; - print_genes_mr() if ($futr); - $proclen+=$len; - }#end if new contig - $x1=$startdrawX+int($a[$ind_s1]/$Xscale); - $x2=$startdrawX+int($a[$ind_e1]/$Xscale); - print_align($x1,$x2); - } - print_legend(); - close(O); - change_file_format() if ($format ne "fig"); - - close(F); -} -#******************************************************************************* -#******************************************************************************* -sub set_output_fname{ - - if (!$outfilename) {$procfile=$prog."_graph"."_";} - else {$procfile=$outfilename."_";} - - if ($format ne "fig"){ - $procfile="tmp".$procfile; - } -} -#********************************* -sub get_cds_ends{ -#3. print "create \%hcds_ends...\n"; -$testGffFormat=0; - open(F,"<".$fcds);#|| die "can't open \" $fcds cds \" file !"; - while() { - chomp; - if($_) { - @a=split; - if (exists $hRefContigId{$a[0]}){#record if at least one of the ref id is the same in GFF and Align files - $testGffFormat++; - } - $genename=$a[8]; - if ($genename ne $tmpname){ - if ($sign eq "+"){ $hcds_ends{$tmpname} = "$cds5 $cds3";} - elsif ($sign eq "-"){ $hcds_ends{$tmpname} = "$cds3 $cds5";} - $tmpname=$genename; - $sign=$a[6]; - } - if($sign eq "-") { - $temp=$a[3]; - $a[3]=$a[4]; - $a[4]=$temp; - } - if ($a[2] eq "single-exon"){ - $cds5=$a[3]; - $cds3=$a[4]; - } - elsif ($a[2] eq "initial-exon"){ - $cds5=$a[3]; - } - elsif ($a[2] eq "last-exon"){ - $cds3=$a[4]; - } - } - } - if ($sign eq "+"){ $hcds_ends{$tmpname} = "$cds5 $cds3";} - elsif ($sign eq "-"){$hcds_ends{$tmpname} = "$cds3 $cds5";} - - test_formatGFF(); - - #foreach $k ( keys %hcds_ends){ - # print "cds_ends: ".$k."\t"."\n"; - #} exit; - -} - -#********************************* -sub test_formatGFF{ - if ($testGffFormat==0){ - print (STDERR "$err_gff \n"); - exit (1); - } - -} - -#********************************* -sub get_utrcds_info{ -#test for gene overlap $geneinfo{gene_name}->[0]=level,stock gene 5'3' utr ends, -#determina %geneinfo{id gene}->5utr,ex,3utr -#and @file -# get_gene_ends(); - $testGffFormat=0; -open(F,"<".$futr);# || die "can't open \" $futr utr \" file !"; -while() { - chomp; - if($_) { - @a=split; - #get gene ends-utr - $genename=$a[8]; - if (exists $hRefContigId{$a[0]}){ #check if [align file(col before the last one)] = [GFF file(col 1)] - $testGffFormat++; - } - - if ($genename ne $tmpname){ - if ($sign eq "+"){ $utr_ends{$tmpname} = "$utr5 $utr3";} - elsif ($sign eq "-"){ $utr_ends{$tmpname} = "$utr3 $utr5";} - - if ($tmpgene) {# for the distinct_utr_cds - get_utrcds_ends() ; - } - $tmpgene="$a[3] $a[4]";# - - $tmpname=$genename; - $sign=$a[6]; - $hContig_genes{$a[0]}.=" ".$a[8]; # for multiple ref. seqs - } - else{# for the distinct_utr_cds - if($sign eq "-") { - $tmpgene="$a[3] $a[4];".$tmpgene; - } - else { $tmpgene.=";$a[3] $a[4]"; } - }# - - if($sign eq "-") { - $temp=$a[3]; - $a[3]=$a[4]; - $a[4]=$temp; - } - if ($a[2] eq "single-exon"){ - $utr5=$a[3]; - $utr3=$a[4]; - } - elsif ($a[2] eq "initial-exon"){ - $utr5=$a[3]; - } - elsif ($a[2] eq "last-exon"){ - $utr3=$a[4]; - } - - #init gene info (level) - $geneinfo{$a[8]}->[0]=1 if (!exists $geneinfo{$a[8]}); - - } -} - # for the distinct_utr_cds - get_utrcds_ends(); - %cds_ends=();# - - if ($sign eq "+"){$utr_ends{$tmpname}="$utr5 $utr3"; } - elsif ($sign eq "-"){$utr_ends{$tmpname}="$utr3 $utr5"; } - $hContig_genes{$a[0]}.=" ".$a[8]; - - close(F); - test_formatGFF(); - -} - - -#**************************************************************** -sub get_utrcds_ends{ - $u5="";$ex="";$u3=""; - - if ($fcds eq $futr) { - $ex=$tmpgene; - } - else { - @ex=split(";",$tmpgene); - @cds=split(" ",$hcds_ends{$tmpname}); - - for($i=0;$i<=$#ex;$i++){ - @coord=split(" ",$ex[$i]); - - if ($cds[0]>$coord[0]){ - if ($cds[0]>$coord[1]){ - $u5.="$coord[0] $coord[1];"; - } - else{ - $u5.= "$coord[0] "; $u5.=$cds[0]-1 .";" ; #? - if ($cds[1]<$coord[1]){ - $ex.="$cds[0] $cds[1];"; - $u3.=$cds[1]+1 ." $coord[1];"; - } - else{ $ex.="$cds[0] $coord[1];";} - } - } - else { - if ($cds[1]>$coord[0]){ - if ($cds[1]>$coord[1]){ - $ex.="$coord[0] $coord[1];"; - } - else{ - $ex.="$coord[0] $cds[1];"; - $u3.=$cds[1]+1 ." $coord[1];"; - } - } - else { $u3.="$coord[0] $coord[1];";} - } - } - chop($u5, $ex, $u3); - } - $geneinfo{$tmpname}->[1]=$sign; - $geneinfo{$tmpname}->[2]=$u5; - $geneinfo{$tmpname}->[3]=$ex; - $geneinfo{$tmpname}->[4]=$u3; -} -#********************************* -sub test_overlap{ - - if (!$mref){ - $fileno=0;### - $endcoord=0;### - } - foreach $kcontgid (sort keys %hContig_genes){ - @allgenes=split(/\s+/,$hContig_genes{$kcontgid}); - for ($i=1;$i<=$#allgenes;$i++){ - @g1=split (" ", $utr_ends{$allgenes[$i]}); - $Utr5End{$allgenes[$i]}=$g1[0]; ### - - for ($j=$i+1;$j<=$#allgenes;$j++){ #comparing with the rest of the genes - @g2=split (" ", $utr_ends{$allgenes[$j]}); - #if the genes are overpaling and they have the same level ,the second gene is liflet to the next level - if ( (($g2[0]>=$g1[0]) and ($g2[0]<=$g1[1])) or (($g2[1]>=$g1[0]) and ($g2[1]<=$g1[1])) or (($g1[0]>=$g2[0]) and ($g1[0]<=$g2[1])) ){ - if ($geneinfo{$allgenes[$i]}->[0] == $geneinfo{$allgenes[$j]}->[0]){ - $geneinfo{$allgenes[$j]}->[0]=$geneinfo{$allgenes[$i]}->[0] + 1 ; - } - } - } - SetTheRangeForEachFile() if ((!$endfind) and (!$mref)); ### - } - } - $file[$fileno++]="$startcoord $endcoord" if (!$mref);### - %utr_ends=();### - -} - -#************************************** -sub SetTheRangeForEachFile{ - $currstart=$g1[0]; - $currend=$g1[1]; - #---test range ends intersection - if (!$startfind) { - if (($x1win <= $currstart) || ($x1win <= $currend)){ - $currstart = $x1win; - $startfind = 1; - } - } - if ( $startfind && $x1win && $x2win){ - if (($x2win <= $currstart) || ($x2win <= $currend) ){ - $currend = $x2win; - $endfind = 1; - } - } -#-------------------- - if ($startfind) { - if(!$endcoord) { - #$startcoord=0; - $startcoord = $x1win ? $x1win : 0; - $endcoord=$currend; - } - else { - if($currend > $endcoord) { - if($currend-$startcoord < $nobpinfile) { - $endcoord=$currend; - } - else { - $file[$fileno++]="$startcoord $endcoord"; - $startcoord=$endcoord+1; - $endcoord=$currend; - } - } - } - }#if startfind -} -#********************************* -sub print_header{ - print O "#FIG 3.2\nLandscape\nCenter\nInches\nLetter \n100.00\nMultiple\n-2\n1200 2\n"; -} -#********************************* -sub print_align{ - my ($x1,$x2)=@_; - - $a[$ind_pidy]=50 if ($a[$ind_pidy]<50); - $a[$ind_pidy]=int($a[$ind_pidy]); - if ($Mgaps){ - $y=$Yorig+250+$YdistPID-$yscale*2; - if($a[$#a]=~/rev$/){$y-=25*$yscale;} - } - else{ - $y=$Yorig+250+$YdistPID-$yscale*($a[$ind_pidy]-50); - } - if($x1==$x2) { $x2++;} - #draw the line between matches. is dif color for each contig - if ($a[$#a] eq $tmpIdQrycontig) { - print_connections($hQrycontig{$tmpIdQrycontig}->[1], $x1,$y); - } - else{#new contig - #remember the start coord for printing the id alignments - if ($printIDconting){ - if ( $x1 - $XlastPrint > 400 ) { - print O "4 0 0 5 0 0 8 0.0000 4 90 270 "; - printf O ("\t%.0f %.0f ",$x1,$y); - print O $a[$#a], "\\001\n"; - $XlastPrint=$x1; $YlastPrint=$y; - } - } - # - ##if it was seen before,but interrupted by another contig - if ((exists $hQrycontig{$a[$#a]}) and ($a[$ind_s1]-$hQrycontig{$a[$#a]}->[2] < $match_dist )) { - $linkcolor=$hQrycontig{$a[$#a]}->[0]; - print_connections($hQrycontig{$a[$#a]}->[1], $x1,$y); - } - else{ - #change the link color - unshift(@linkcolors, pop(@linkcolors)); - $linkcolor=$linkcolors[0]; - $hQrycontig{$a[$#a]}->[0] = $linkcolor; - } - } - $tmpIdQrycontig=$a[$#a]; - $hQrycontig{$tmpIdQrycontig}->[1]="$x2 $y"; - $hQrycontig{$tmpIdQrycontig}->[2]=$a[$ind_e1]; - - #the matches line is red - print O "2 1 0 2 4 0 40 0 -1 0.000 0 0 -1 0 0 2\n"; - print O "\t$x1 $y $x2 $y\n"; - print O "2 1 0 5 20 0 50 0 -1 0.000 0 0 -1 0 0 2\n"; - printf O ("\t $x1 %.0f $x2 %.0f\n",$Yorig+150 , $Yorig+150); -} -#********************************* -sub print_connections{ - my ($setc1,$setx2,$sety2)=@_; # print "\nparam connect @_\n"; - my ($setx1,$sety1) =split(/ /,$setc1); - - if ($Mgaps){ - if ($setx1>$setx2){ - $tmpsetx1=$setx1; - $setx1=$setx2; - $setx2=$tmpsetx1; - } - $distx1x2=int(($setx2-$setx1)/2); - $xcenter= $setx1+$distx1x2; - - if ($setx2-$setx1>4000) { #if the distance is to big then heigh of the arc is set to 20 - $heightArcUp = 20*$yscale; - $yoffcenter=int((($distx1x2**2)+$heightArcUp**2)*(1/(2*$heightArcUp)))-$heightArcUp ; - } - else{ - $heightArcUp = int (0.447 * $distx1x2);#sectorul de cerc la 1/3 din raza. - $yoffcenter=2*$heightArcUp; - } - print O "5 1 0 2 $linkcolor 0 50 0 -1 0.000 0 0 0 0 "; - printf O ("%.3f %.3f $setx1 $sety1 $xcenter %.0f $setx2 $sety1 \n",$xcenter,$sety1+$yoffcenter,$sety1-$heightArcUp); - } - else{ - print O "2 1 0 1 $linkcolor 0 50 0 -1 0.000 0 0 -1 0 0 2\n"; - print O "\t".$setc1." $setx2 $sety2\n"; - - } -} -#********************************* -sub print_genes_mr{ - %hLastOnLevel=(); - @g=split(/\s+/,$hContig_genes{$a[-2]}); - for ($i=1;$i<=$#g;$i++){ - $kname=$g[$i]; - $tmpx2=0; - $y=$Yorig-100-200*$geneinfo{$kname}->[0]; - #print id gena - $xid =$startdrawX+ int($Utr5End{$kname}/$Xscale); - print_Id_genes() if ($printIDgenes); - # - for ($l=2;$l<5;$l++){ - @c=split(";",$geneinfo{$kname}->[$l] ) ; - if (@c){ #print "de unde?@c\n" if (($l==2) or ($l==4)); - $colorend=$color{$l}; - if ($geneinfo{$kname}->[1] eq "-"){ - if ($l==2) { $colorend=$color{"4"}; } - elsif ($l==4) {$colorend=$color{"2"};} - } - for ($k=0;$k<=$#c;$k++){ - @e=split (" ",$c[$k]); - $x1=$startdrawX+int($e[0]/$Xscale); - $x2=$startdrawX+int($e[1]/$Xscale); - if($x1==$x2) { $x2++;} - if ( ($tmpx2) and ($x1-$tmpx2>1)){ #print the intron - print O "2 1 0 1 0 0 50 0 -1 0.000 0 0 -1 0 0 2\n"; - print O "\t $tmpx2 $y $x1 $y\n"; - } - $tmpx2=$x2; - print O "2 1 0 5 $colorend 0 50 0 -1 0.000 0 0 -1 0 0 2\n";# - print O "\t $x1 $y $x2 $y\n"; - } - } - } - #delete ($geneinfo{$kname}); - } -} - -#********************************* -sub print_Id_genes{ - if (exists $hLastOnLevel{$geneinfo{$kname}->[0]}){ - $lastOnlevel=$hLastOnLevel{$geneinfo{$kname}->[0]}; - $printidspace = int(($Utr5End{$kname}-$Utr5End{$lastOnlevel})/$Xscale); - } - else{$printidspace=601;} - if ($printidspace > 600){ - #print contig name# - print O "4 0 0 5 0 0 6 0.0000 4 90 270 "; - printf O ("\t%.0f %.0f ",$xid,$y-50); - print O $kname, "\\001\n"; - $hLastOnLevel{$geneinfo{$kname}->[0]}=$kname; - } -} -#********************************* -sub print_genes{ - %hLastOnLevel=(); - foreach $kname (sort {$Utr5End{$a} <=> $Utr5End{$b}} keys %Utr5End){ - $tmpx2=0; - if ($Utr5End{$kname}>$startcoord && $Utr5End{$kname}<$endcoord){ - $y=$Yorig-100-200*$geneinfo{$kname}->[0]; - #print id gena - $xid = int(($Utr5End{$kname}-$startcoord)/$Xscale); - print_Id_genes() if ($printIDgenes); - # - for ($l=2;$l<5;$l++){ - @c=split(";",$geneinfo{$kname}->[$l] ); - $colorend=$color{$l}; - if ($geneinfo{$kname}->[1] eq "-"){ - if ($l==2) { $colorend=$color{"4"}; } - elsif ($l==4) {$colorend=$color{"2"};} - } - - for ($k=0;$k<=$#c;$k++){ - @e=split (" ",$c[$k]); - $x1=int(($e[0]-$startcoord)/$Xscale); - $x2=int(($e[1]-$startcoord)/$Xscale); - if($x1==$x2) { $x2++;} - if ( ($tmpx2) and ($x1-$tmpx2>1)){ #print the intron - print O "2 1 0 1 0 0 50 0 -1 0.000 0 0 -1 0 0 2\n"; - print O "\t $tmpx2 $y $x1 $y\n"; - } - $tmpx2=$x2; - print O "2 1 0 5 $colorend 0 50 0 -1 0.000 0 0 -1 0 0 2\n";# - print O "\t $x1 $y $x2 $y\n"; - } - } - }#endif "is in interval" - } - # delete ($Utr5End{$kname}); - # delete ($geneinfo{$kname}); - -} -#********************************* -sub print_grid{ - -my ($xs,$xe,$startcontg,$endcontg)=@_; - -$XlastPrint=0;$YlastPrint=0; - - #print ref contig - print O "2 1 0 10 11 0 50 0 -1 0.000 0 0 -1 0 0 2\n"; - printf O ("\t $xs %.0f $xe %.0f\n",$Yorig+50,$Yorig+50); - #print orizontal axes for PId (100%,75%,50%) - for ($percent_id = 50; $percent_id < 101; $percent_id += 25) { - print O "2 1 2 1 0 7 60 0 -1 4.000 0 0 -1 0 0 2\n"; - printf O ("\t$xs %.0f $xe %.0f\n",$Yorig+250+$YdistPID-($percent_id - 50) * $yscale,$Yorig+250+$YdistPID-($percent_id - 50) * $yscale); - #last if ($Mgaps); - } - #print orizontal markers for bp. - $increment=10000/$Xscale; - $no_incr=0; - $xmark = $xs ;$xmark_float= $xs; - while ($xmark < $xe){ - print O "2 1 0 1 0 7 60 0 -1 0.000 0 0 -1 0 0 2\n"; - printf O ("\t$xmark %.0f $xmark %.0f\n",$Yorig+$YdistPID+250,$Yorig+$YdistPID+300); - #bp scale - print O "4 0 0 100 0 0 8 0.0000 4 135 405 "; - printf O ("\t %.0f %.0f",$xmark,$Yorig+$YdistPID+400); - print O " $no_incr"."k", "\\001\n"; - $no_incr += 10; - $xmark_float += $increment; - $xmark=int($xmark_float); - } - - #coord for chr ends - print O "4 0 0 50 0 0 14 0.0000 4 135 450 $xs $Yorig $startcontg\\001\n"; - printf O ("4 0 0 50 0 0 14 0.0000 4 135 810 %.0f $Yorig $endcontg\\001\n",$xe-length($xe)*125); - - #print contig name# - if ($mref){ - print O "4 0 0 5 0 0 8 0.0000 4 135 405 "; - printf O ("\t%.0f %.0f ",$xs,$Yorig+70); - print O $a[-2], "\\001\n"; - } - #print vertical markers for PId scale - if (!$Mgaps){ - for ($percent_id = 50; $percent_id < 101; $percent_id += 25) { - #left - print O "4 0 0 100 0 0 8 0.0000 4 135 405 "; - printf O ("\t%.0f %.0f", $xs-200,$Yorig+$YdistPID+250-($percent_id - 50) * $yscale + 20); - print O " $percent_id%", "\\001\n"; - #right - print O "4 0 0 100 0 0 8 0.0000 4 135 405 "; - printf O ("\t%.0f %.0f",$xe+20, $Yorig+$YdistPID+250-($percent_id - 50) * $yscale+20 ); - print O " $percent_id%", "\\001\n"; - - # print the tick mark - #left - # print O "2 1 0 1 0 7 60 0 -1 0.000 0 0 -1 0 0 2\n"; - # printf O ("\t%.0f %.0f $xs %.0f\n",$xs-50, - # $Yorig+$YdistPID+250-($percent_id - 50) * $yscale, $Yorig+$YdistPID+250-($percent_id - 50) * $yscale); - #right - # print O "2 1 0 1 0 7 60 0 -1 0.000 0 0 -1 0 0 2\n"; - # printf O ("\t$xe %.0f %.0f %.0f\n", - # $Yorig+$YdistPID+250-($percent_id - 50) * $yscale,$xe+50, $Yorig+$YdistPID+250-($percent_id - 50) * $yscale); - } - } - else{ # for Mgaps - print O "4 0 0 100 0 0 7 1.5710 4 135 405 "; - printf O ("\t%.0f %.0f", $xs-50,$Yorig+$YdistPID+250 - 5 * $yscale + 10); - print O " + qry strand", "\\001\n"; - - print O "4 0 0 100 0 0 7 1.5710 4 135 405 "; - printf O ("\t%.0f %.0f", $xs-50,$Yorig+$YdistPID+250 - 30 * $yscale + 10); - print O " - qry strand", "\\001\n"; - } - -} -#********************************* -sub print_legend{ - - print O "4 0 0 100 0 0 8 0.0000 4 135 405 "; - printf O ("\t%.0f %.0f ",100,$Yorig+$YdistPID+1100); - print O " Legend ", "\\001\n"; - $y= $Yorig+$YdistPID+1300; #utr - print O "2 1 0 1 0 0 50 0 -1 0.000 0 0 -1 0 0 2\n";#intron - print O "\t 70 $y 99 $y\n"; - print O "2 1 0 5 27 0 50 0 -1 0.000 0 0 -1 0 0 2\n"; - print O "\t 100 $y 200 $y\n"; - print O "2 1 0 1 0 0 50 0 -1 0.000 0 0 -1 0 0 2\n";#intron - print O "\t 200 $y 230 $y\n"; - print O "4 0 0 100 0 0 8 0.0000 4 135 405 "; - printf O ("\t%.0f %.0f ",300,$y+30); - print O " 5' utr ", "\\001\n"; - $y += 150 ;#cds - print O "2 1 0 1 0 0 50 0 -1 0.000 0 0 -1 0 0 2\n";#intron - print O "\t 70 $y 99 $y\n"; - print O "2 1 0 5 2 0 50 0 -1 0.000 0 0 -1 0 0 2\n"; - print O "\t 100 $y 200 $y\n"; - print O "2 1 0 1 0 0 50 0 -1 0.000 0 0 -1 0 0 2\n";#intron - print O "\t 200 $y 230 $y\n"; - print O "4 0 0 100 0 0 8 0.0000 4 135 405 "; - printf O ("\t%.0f %.0f ",300,$y+30); - print O " cds ", "\\001\n"; - $y += 150; #3' utr - print O "2 1 0 1 0 0 50 0 -1 0.000 0 0 -1 0 0 2\n";#intron - print O "\t 70 $y 99 $y\n"; - print O "2 1 0 5 1 0 50 0 -1 0.000 0 0 -1 0 0 2\n"; - print O "\t 100 $y 200 $y\n"; - print O "2 1 0 1 0 0 50 0 -1 0.000 0 0 -1 0 0 2\n";#intron - print O "\t 200 $y 230 $y\n"; - print O "4 0 0 100 0 0 8 0.0000 4 135 405 "; - printf O ("\t%.0f %.0f ",300,$y+30); - print O " 3' utr ", "\\001\n"; - $y += 150; # match - print O "2 1 0 2 4 0 50 0 -1 0.000 0 0 -1 0 0 2\n"; - print O "\t100 $y 200 $y\n"; - print O "4 0 0 100 0 0 8 0.0000 4 135 405 "; - printf O ("\t%.0f %.0f ",300,$y+30); - print O " match found by $prog ", "\\001\n"; -} -#********************************* -sub change_file_format{ - - $procfile =~ /^tmp(.+)/; - $outfile = $1.$nrf.".".$format; - $comand = "fig2dev -L $format -x 100"; - $comand .= " -m $magn" if ($magn); - $comand .= " -M ".$procfile.$nrf.".fig ".$outfile; - - $status =system($comand); - print E "ERROR 1: fig2dev !\n" unless $status == 0; - - $status =system("rm $procfile".$nrf.".fig"); - - if ($verb){ - print "$outfile"; - if ($mref){ - print "\n" ; - } - else { - print "\t range : $startcoord\t$endcoord \n" ; - } - } -} -#********************************* -sub format_mgaps{ -$tmpfile="tmpmgaps"; -$tmpfile2=$alignm."coords" ; -get_ref_len(); #print $maxlenref."\n"; - open(M,">".$tmpfile2) || die "can't open \" $tmpfile2 \" file !"; - print M "$alignm\n"; - print M "Mgaps\n\n"; - print M " [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS]\n"; - print M "===============================================================================================================================\n"; - - open(T,">".$tmpfile) || die "can't open \" $tmpfile \" file !"; - - open(A,"<".$alignm) || die "can't open \" $alignm \" file !"; - - - while() { - chomp; - @a=split; - if ($a[0] =~ /^>/){ - $nr_cluster=1; - $idquery=$a[1]; - if ($a[2] eq "Reverse"){$idquery .= "_rev";} - } - #elsif ($a[0] eq "#") {$nr_cluster++;} - elsif($a[0] ne "#"){ - $e1=$a[0]+$a[2]; - print T $a[0]."\t".$e1."\t"."|"; - print T "\t-\t-\t|"; - print T "\t-\t-\t|"; - print T "\t-\t|"; #pid - print T "\t$maxlenref\t-\t|";#len seqs - # print "\t-\t-\t|"; - # print "\t-\t-\t|"; - # print T " $Mgaps\t$idquery.$nr_cluster\n"; - print T " $Mgaps\t$idquery\n"; - } - - } - close(A); - close(T); - $command="sort -n -k 1 $tmpfile >> $tmpfile2"; - $status =system($command); - system("rm $tmpfile"); - close(M); - $alignm=$tmpfile2; - print STDERR "ERROR 1: can't sort $tmpfile \n" unless $status == 0; - print STDERR "\n**************************************** \n"; - print STDERR "New input file created : $alignm\n"; - print STDERR "**************************************** \n\n"; - -} -#***************************** -sub get_ref_len{ - $firstrow=1; - open(A,"<".$alignm) || die "can't open \" $alignm \" file !"; - while() { - chomp; - @a=split; - if ($firstrow) { - $firstrow=0; - if ($a[0] !~ /^>/){ - print "\nWrong file format for MGAPS file : $alignm ! \n"; - exit; - } - } - - if ($a[0] =~ /^>/){ next; } - elsif($a[0] ne "#"){ - $e1=$a[0]+$a[2]; - $maxlenref=($maxlenref < $e1 ? $e1 : $maxlenref); - } - } - close(A); -} diff --git a/tools/MUMmer3.23/mgaps b/tools/MUMmer3.23/mgaps deleted file mode 100755 index ea8a34d..0000000 Binary files a/tools/MUMmer3.23/mgaps and /dev/null differ diff --git a/tools/MUMmer3.23/mummer b/tools/MUMmer3.23/mummer deleted file mode 100755 index 048726a..0000000 Binary files a/tools/MUMmer3.23/mummer and /dev/null differ diff --git a/tools/MUMmer3.23/mummerplot b/tools/MUMmer3.23/mummerplot deleted file mode 100755 index aa4436b..0000000 --- a/tools/MUMmer3.23/mummerplot +++ /dev/null @@ -1,1602 +0,0 @@ -#!/usr/bin/perl - -################################################################################ -# Programmer: Adam M Phillippy, The Institute for Genomic Research -# File: mummerplot -# Date: 01 / 08 / 03 -# 01 / 06 / 05 rewritten (v3.0) -# -# Usage: -# mummerplot [options] -# -# Try 'mummerplot -h' for more information. -# -# Purpose: To generate a gnuplot plot for the display of mummer, nucmer, -# promer, and show-tiling alignments. -# -################################################################################ - -use lib "/export/home/zqhu/tools/MUMmer3.23/scripts"; -use Foundation; -use strict; -use IO::Socket; - -my $BIN_DIR = "/export/home/zqhu/tools/MUMmer3.23"; -my $SCRIPT_DIR = "/export/home/zqhu/tools/MUMmer3.23/scripts"; - - -#================================================================= Globals ====# -#-- terminal types -my $X11 = "x11"; -my $PS = "postscript"; -my $PNG = "png"; - -#-- terminal sizes -my $SMALL = "small"; -my $MEDIUM = "medium"; -my $LARGE = "large"; - -my %TERMSIZE = - ( - $X11 => { $SMALL => 500, $MEDIUM => 700, $LARGE => 900 }, # screen pix - $PS => { $SMALL => 1, $MEDIUM => 2, $LARGE => 3 }, # pages - $PNG => { $SMALL => 800, $MEDIUM => 1024, $LARGE => 1400 } # image pix - ); - -#-- terminal format -my $FFACE = "Courier"; -my $FSIZE = "8"; -my $TFORMAT = "%.0f"; -my $MFORMAT = "[%.0f, %.0f]"; - -#-- output suffixes -my $FILTER = "filter"; -my $FWDPLOT = "fplot"; -my $REVPLOT = "rplot"; -my $HLTPLOT = "hplot"; -my $GNUPLOT = "gnuplot"; - -my %SUFFIX = - ( - $FILTER => ".filter", - $FWDPLOT => ".fplot", - $REVPLOT => ".rplot", - $HLTPLOT => ".hplot", - $GNUPLOT => ".gp", - $PS => ".ps", - $PNG => ".png" - ); - - -#================================================================= Options ====# -my $OPT_breaklen; # -b option -my $OPT_color; # --[no]color option -my $OPT_coverage; # --[no]coverage option -my $OPT_filter; # -f option -my $OPT_layout; # -l option -my $OPT_prefix = "out"; # -p option -my $OPT_rv; # --rv option -my $OPT_terminal = $X11; # -t option -my $OPT_IdR; # -r option -my $OPT_IdQ; # -q option -my $OPT_IDRfile; # -R option -my $OPT_IDQfile; # -Q option -my $OPT_rport; # -rport option -my $OPT_qport; # -qport option -my $OPT_size = $SMALL; # -small, -medium, -large -my $OPT_SNP; # -S option -my $OPT_xrange; # -x option -my $OPT_yrange; # -y option -my $OPT_title; # -title option - -my $OPT_Mfile; # match file -my $OPT_Dfile; # delta filter file -my $OPT_Ffile; # .fplot output -my $OPT_Rfile; # .rplot output -my $OPT_Hfile; # .hplot output -my $OPT_Gfile; # .gp output -my $OPT_Pfile; # .ps .png output - -my $OPT_gpstatus; # gnuplot status - -my $OPT_ONLY_USE_FATTEST; # Only use fattest alignment for layout - - -#============================================================== Foundation ====# -my $VERSION = '3.5'; - -my $USAGE = qq~ - USAGE: mummerplot [options] - ~; - -my $HELP = qq~ - USAGE: mummerplot [options] - - DESCRIPTION: - mummerplot generates plots of alignment data produced by mummer, nucmer, - promer or show-tiling by using the GNU gnuplot utility. After generating - the appropriate scripts and datafiles, mummerplot will attempt to run - gnuplot to generate the plot. If this attempt fails, a warning will be - output and the resulting .gp and .[frh]plot files will remain so that the - user may run gnuplot independently. If the attempt succeeds, either an x11 - window will be spawned or an additional output file will be generated - (.ps or .png depending on the selected terminal). Feel free to edit the - resulting gnuplot script (.gp) and rerun gnuplot to change line thinkness, - labels, colors, plot size etc. - - MANDATORY: - match file Set the alignment input to 'match file' - Valid inputs are from mummer, nucmer, promer and - show-tiling (.out, .cluster, .delta and .tiling) - - OPTIONS: - -b|breaklen Highlight alignments with breakpoints further than - breaklen nucleotides from the nearest sequence end - --[no]color Color plot lines with a percent similarity gradient or - turn off all plot color (default color by match dir) - If the plot is very sparse, edit the .gp script to plot - with 'linespoints' instead of 'lines' - -c - --[no]coverage Generate a reference coverage plot (default for .tiling) - --depend Print the dependency information and exit - -f - --filter Only display .delta alignments which represent the "best" - hit to any particular spot on either sequence, i.e. a - one-to-one mapping of reference and query subsequences - -h - --help Display help information and exit - -l - --layout Layout a .delta multiplot in an intelligible fashion, - this option requires the -R -Q options - --fat Layout sequences using fattest alignment only - -p|prefix Set the prefix of the output files (default '$OPT_prefix') - -rv Reverse video for x11 plots - -r|IdR Plot a particular reference sequence ID on the X-axis - -q|IdQ Plot a particular query sequence ID on the Y-axis - -R|Rfile Plot an ordered set of reference sequences from Rfile - -Q|Qfile Plot an ordered set of query sequences from Qfile - Rfile/Qfile Can either be the original DNA multi-FastA - files or lists of sequence IDs, lens and dirs [ /+/-] - -r|rport Specify the port to send reference ID and position on - mouse double click in X11 plot window - -q|qport Specify the port to send query IDs and position on mouse - double click in X11 plot window - -s|size Set the output size to small, medium or large - --small --medium --large (default '$OPT_size') - -S - --SNP Highlight SNP locations in each alignment - -t|terminal Set the output terminal to x11, postscript or png - --x11 --postscript --png (default '$OPT_terminal') - -t|title Specify the gnuplot plot title (default none) - -x|xrange Set the xrange for the plot '[min:max]' - -y|yrange Set the yrange for the plot '[min:max]' - -V - --version Display the version information and exit - ~; - -my @DEPEND = - ( - "$SCRIPT_DIR/Foundation.pm", - "$BIN_DIR/delta-filter", - "$BIN_DIR/show-coords", - "$BIN_DIR/show-snps", - "gnuplot" - ); - -my $tigr = new TIGR::Foundation - or die "ERROR: TIGR::Foundation could not be initialized\n"; - -$tigr -> setVersionInfo ($VERSION); -$tigr -> setUsageInfo ($USAGE); -$tigr -> setHelpInfo ($HELP); -$tigr -> addDependInfo (@DEPEND); - - -#=========================================================== Function Decs ====# -sub GetParseFunc( ); - -sub ParseIDs($$); - -sub ParseDelta($); -sub ParseCluster($); -sub ParseMummer($); -sub ParseTiling($); - -sub LayoutIDs($$); -sub SpanXwY ($$$$$); - -sub PlotData($$$); -sub WriteGP($$); -sub RunGP( ); -sub ListenGP($$); - -sub ParseOptions( ); - - -#=========================================================== Function Defs ====# -MAIN: -{ - my @aligns; # (sR eR sQ eQ sim lenR lenQ idR idQ) - my %refs; # (id => (off, len, [1/-1])) - my %qrys; # (id => (off, len, [1/-1])) - - #-- Get the command line options (sets OPT_ global vars) - ParseOptions( ); - - - #-- Get the alignment type - my $parsefunc = GetParseFunc( ); - - if ( $parsefunc != \&ParseDelta && - ($OPT_filter || $OPT_layout || $OPT_SNP) ) { - print STDERR "WARNING: -f -l -S only work with delta input\n"; - undef $OPT_filter; - undef $OPT_layout; - undef $OPT_SNP; - } - - #-- Parse the reference and query IDs - if ( defined $OPT_IdR ) { $refs{$OPT_IdR} = [ 0, 0, 1 ]; } - elsif ( defined $OPT_IDRfile ) { - ParseIDs ($OPT_IDRfile, \%refs); - } - - if ( defined $OPT_IdQ ) { $qrys{$OPT_IdQ} = [ 0, 0, 1 ]; } - elsif ( defined $OPT_IDQfile ) { - ParseIDs ($OPT_IDQfile, \%qrys); - } - - - #-- Filter the alignments - if ( $OPT_filter || $OPT_layout ) { - print STDERR "Writing filtered delta file $OPT_Dfile\n"; - system ("$BIN_DIR/delta-filter -r -q $OPT_Mfile > $OPT_Dfile") - and die "ERROR: Could not run delta-filter, $!\n"; - if ( $OPT_filter ) { $OPT_Mfile = $OPT_Dfile; } - } - - - #-- Parse the alignment data - $parsefunc->(\@aligns); - - - #-- Layout the alignment data if requested - if ( $OPT_layout ) { - if ( scalar (keys %refs) || scalar (keys %qrys) ) { - LayoutIDs (\%refs, \%qrys); - } - else { - print STDERR "WARNING: --layout option only works with -R or -Q\n"; - undef $OPT_layout; - } - } - - - #-- Plot the alignment data - PlotData (\@aligns, \%refs, \%qrys); - - - #-- Write the gnuplot script - WriteGP (\%refs, \%qrys); - - - #-- Run gnuplot script and fork a clipboard listener - unless ( $OPT_gpstatus == -1 ) { - - my $child = 1; - if ( $OPT_gpstatus == 0 && $OPT_terminal eq $X11 ) { - print STDERR "Forking mouse listener\n"; - $child = fork; - } - - #-- parent runs gnuplot - if ( $child ) { - RunGP( ); - kill 1, $child; - } - #-- child listens to clipboard - elsif ( defined $child ) { - ListenGP(\%refs, \%qrys); - } - else { - print STDERR "WARNING: Could not fork mouse listener\n"; - } - } - - exit (0); -} - - -#------------------------------------------------------------ GetParseFunc ----# -sub GetParseFunc ( ) -{ - my $fref; - - open (MFILE, "<$OPT_Mfile") - or die "ERROR: Could not open $OPT_Mfile, $!\n"; - - $_ = ; - if ( !defined ) { die "ERROR: Could not read $OPT_Mfile, File is empty\n" } - - SWITCH: { - #-- tiling - if ( /^>\S+ \d+ bases/ ) { - $fref = \&ParseTiling; - last SWITCH; - } - - #-- mummer - if ( /^> \S+/ ) { - $fref = \&ParseMummer; - last SWITCH; - } - - #-- nucmer/promer - if ( /^(\S+) (\S+)/ ) { - if ( ! defined $OPT_IDRfile ) { - $OPT_IDRfile = $1; - } - if ( ! defined $OPT_IDQfile ) { - $OPT_IDQfile = $2; - } - - $_ = ; - if ( (defined) && (/^NUCMER$/ || /^PROMER$/) ) { - $_ = ; # sequence header - $_ = ; # alignment header - if ( !defined ) { - $fref = \&ParseDelta; - last SWITCH; - } - elsif ( /^\d+ \d+ \d+ \d+ \d+ \d+ \d+$/ ) { - $fref = \&ParseDelta; - last SWITCH; - } - elsif ( /^[ \-][1-3] [ \-][1-3]$/ ) { - $fref = \&ParseCluster; - last SWITCH; - } - } - } - - #-- default - die "ERROR: Could not read $OPT_Mfile, Unrecognized file type\n"; - } - - close (MFILE) - or print STDERR "WARNING: Trouble closing $OPT_Mfile, $!\n"; - - return $fref; -} - - -#---------------------------------------------------------------- ParseIDs ----# -sub ParseIDs ($$) -{ - my $file = shift; - my $href = shift; - - open (IDFILE, "<$file") - or print STDERR "WARNING: Could not open $file, $!\n"; - - my $dir; - my $aref; - my $isfasta; - my $offset = 0; - while ( ) { - #-- Ignore blank lines - if ( /^\s*$/ ) { next; } - - #-- FastA header - if ( /^>(\S+)/ ) { - if ( exists $href->{$1} ) { - print STDERR "WARNING: Duplicate sequence '$1' ignored\n"; - undef $aref; - next; - } - - if ( !$isfasta ) { $isfasta = 1; } - if ( defined $aref ) { $offset += $aref->[1] - 1; } - - $aref = [ $offset, 0, 1 ]; - $href->{$1} = $aref; - next; - } - - #-- FastA sequence - if ( $isfasta && /^\S+$/ ) { - if ( defined $aref ) { $aref->[1] += (length) - 1; } - next; - } - - #-- ID len dir - if ( !$isfasta && /^(\S+)\s+(\d+)\s+([+-]?)$/ ) { - if ( exists $href->{$1} ) { - print STDERR "WARNING: Duplicate sequence '$1' ignored\n"; - undef $aref; - next; - } - - $dir = (defined $3 && $3 eq "-") ? -1 : 1; - $aref = [ $offset, $2, $dir ]; - $offset += $2 - 1; - $href->{$1} = $aref; - next; - } - - #-- default - print STDERR "WARNING: Could not parse $file\n$_"; - undef %$href; - last; - } - - close (IDFILE) - or print STDERR "WARNING: Trouble closing $file, $!\n"; -} - - -#-------------------------------------------------------------- ParseDelta ----# -sub ParseDelta ($) -{ - my $aref = shift; - - print STDERR "Reading delta file $OPT_Mfile\n"; - - open (MFILE, "<$OPT_Mfile") - or die "ERROR: Could not open $OPT_Mfile, $!\n"; - - my @align; - my $ispromer; - my ($sim, $tot); - my ($lenR, $lenQ, $idR, $idQ); - - $_ = ; - $_ = ; - $ispromer = /^PROMER/; - - while ( ) { - #-- delta int - if ( /^([-]?\d+)$/ ) { - if ( $1 < 0 ) { - $tot ++; - } - elsif ( $1 == 0 ) { - $align[4] = ($tot - $sim) / $tot * 100.0; - push @$aref, [ @align ]; - $tot = 0; - } - next; - } - - #-- alignment header - if ( /^(\d+) (\d+) (\d+) (\d+) \d+ (\d+) \d+$/ ) { - if ( $tot == 0 ) { - @align = ($1, $2, $3, $4, 0, $lenR, $lenQ, $idR, $idQ); - $tot = abs($1 - $2) + 1; - $sim = $5; - if ( $ispromer ) { $tot /= 3.0; } - next; - } - #-- drop to default - } - - #-- sequence header - if ( /^>(\S+) (\S+) (\d+) (\d+)$/ ) { - ($idR, $idQ, $lenR, $lenQ) = ($1, $2, $3, $4); - $tot = 0; - next; - } - - #-- default - die "ERROR: Could not parse $OPT_Mfile\n$_"; - } - - close (MFILE) - or print STDERR "WARNING: Trouble closing $OPT_Mfile, $!\n"; -} - - -#------------------------------------------------------------ ParseCluster ----# -sub ParseCluster ($) -{ - my $aref = shift; - - print STDERR "Reading cluster file $OPT_Mfile\n"; - - open (MFILE, "<$OPT_Mfile") - or die "ERROR: Could not open $OPT_Mfile, $!\n"; - - my @align; - my ($dR, $dQ, $len); - my ($lenR, $lenQ, $idR, $idQ); - - $_ = ; - $_ = ; - - while ( ) { - #-- match - if ( /^\s+(\d+)\s+(\d+)\s+(\d+)\s+\S+\s+\S+$/ ) { - @align = ($1, $1, $2, $2, 100, $lenR, $lenQ, $idR, $idQ); - $len = $3 - 1; - $align[1] += $dR == 1 ? $len : -$len; - $align[3] += $dQ == 1 ? $len : -$len; - push @$aref, [ @align ]; - next; - } - - #-- cluster header - if ( /^[ \-][1-3] [ \-][1-3]$/ ) { - $dR = /^-/ ? -1 : 1; - $dQ = /-[1-3]$/ ? -1 : 1; - next; - } - - #-- sequence header - if ( /^>(\S+) (\S+) (\d+) (\d+)$/ ) { - ($idR, $idQ, $lenR, $lenQ) = ($1, $2, $3, $4); - next; - } - - #-- default - die "ERROR: Could not parse $OPT_Mfile\n$_"; - } - - close (MFILE) - or print STDERR "WARNING: Trouble closing $OPT_Mfile, $!\n"; -} - - -#------------------------------------------------------------- ParseMummer ----# -sub ParseMummer ($) -{ - my $aref = shift; - - print STDERR "Reading mummer file $OPT_Mfile (use mummer -c)\n"; - - open (MFILE, "<$OPT_Mfile") - or die "ERROR: Could not open $OPT_Mfile, $!\n"; - - my @align; - my ($dQ, $len); - my ($lenQ, $idQ); - - while ( ) { - #-- 3 column match - if ( /^\s+(\d+)\s+(\d+)\s+(\d+)$/ ) { - @align = ($1, $1, $2, $2, 100, 0, $lenQ, "REF", $idQ); - $len = $3 - 1; - $align[1] += $len; - $align[3] += $dQ == 1 ? $len : -$len; - push @$aref, [ @align ]; - next; - } - - #-- 4 column match - if ( /^\s+(\S+)\s+(\d+)\s+(\d+)\s+(\d+)$/ ) { - @align = ($2, $2, $3, $3, 100, 0, $lenQ, $1, $idQ); - $len = $4 - 1; - $align[1] += $len; - $align[3] += $dQ == 1 ? $len : -$len; - push @$aref, [ @align ]; - next; - } - - #-- sequence header - if ( /^> (\S+)/ ) { - $idQ = $1; - $dQ = /^> \S+ Reverse/ ? -1 : 1; - $lenQ = /Len = (\d+)/ ? $1 : 0; - next; - } - - #-- default - die "ERROR: Could not parse $OPT_Mfile\n$_"; - } - - close (MFILE) - or print STDERR "WARNING: Trouble closing $OPT_Mfile, $!\n"; -} - - -#------------------------------------------------------------- ParseTiling ----# -sub ParseTiling ($) -{ - my $aref = shift; - - print STDERR "Reading tiling file $OPT_Mfile\n"; - - open (MFILE, "<$OPT_Mfile") - or die "ERROR: Could not open $OPT_Mfile, $!\n"; - - my @align; - my ($dR, $dQ, $len); - my ($lenR, $lenQ, $idR, $idQ); - - while ( ) { - #-- tile - if ( /^(\S+)\s+\S+\s+\S+\s+(\d+)\s+\S+\s+(\S+)\s+([+-])\s+(\S+)$/ ) { - @align = ($1, $1, 1, 1, $3, $lenR, $2, $idR, $5); - $len = $2 - 1; - $align[1] += $len; - $align[($4 eq "-" ? 2 : 3)] += $len; - push @$aref, [ @align ]; - next; - } - - #-- sequence header - if ( /^>(\S+) (\d+) bases$/ ) { - ($idR, $lenR) = ($1, $2); - next; - } - - #-- default - die "ERROR: Could not parse $OPT_Mfile\n$_"; - } - - close (MFILE) - or print STDERR "WARNING: Trouble closing $OPT_Mfile, $!\n"; - - if ( ! defined $OPT_coverage ) { $OPT_coverage = 1; } -} - - -#--------------------------------------------------------------- LayoutIDs ----# -# For each reference and query sequence, find the set of alignments that -# produce the heaviest (both in non-redundant coverage and percent -# identity) alignment subset of each sequence using a modified version -# of the longest increasing subset algorithm. Let R be the union of all -# reference LIS subsets, and Q be the union of all query LIS -# subsets. Let S be the intersection of R and Q. Using this LIS subset, -# recursively span reference and query sequences by their smaller -# counterparts until all spanning sequences have been placed. The goal -# is to cluster all the "major" alignment information along the main -# diagonal for easy viewing and interpretation. -sub LayoutIDs ($$) -{ - my $rref = shift; - my $qref = shift; - - my %rc; # chains of qry seqs needed to span each ref - my %qc; # chains of ref seqs needed to span each qry - # {idR} -> [ placed, len, {idQ} -> [ \slope, \loR, \hiR, \loQ, \hiQ ] ] - # {idQ} -> [ placed, len, {idR} -> [ \slope, \loQ, \hiQ, \loR, \hiR ] ] - - my @rl; # oo of ref seqs - my @ql; # oo of qry seqs - # [ [idR, slope] ] - # [ [idQ, slope] ] - - #-- get the filtered alignments - open (BTAB, "$BIN_DIR/show-coords -B $OPT_Dfile |") - or die "ERROR: Could not open show-coords pipe, $!\n"; - - my @align; - my ($sR, $eR, $sQ, $eQ, $lenR, $lenQ, $idR, $idQ); - my ($loR, $hiR, $loQ, $hiQ); - my ($dR, $dQ, $slope); - while ( ) { - chomp; - @align = split "\t"; - if ( scalar @align != 21 ) { - die "ERROR: Could not read show-coords pipe, invalid btab format\n"; - } - - $sR = $align[8]; $eR = $align[9]; - $sQ = $align[6]; $eQ = $align[7]; - $lenR = $align[18]; $lenQ = $align[2]; - $idR = $align[5]; $idQ = $align[0]; - - #-- skip it if not on include list - if ( !exists $rref->{$idR} || !exists $qref->{$idQ} ) { next; } - - #-- get orientation of both alignments and alignment slope - $dR = $sR < $eR ? 1 : -1; - $dQ = $sQ < $eQ ? 1 : -1; - $slope = $dR == $dQ ? 1 : -1; - - #-- get lo's and hi's - $loR = $dR == 1 ? $sR : $eR; - $hiR = $dR == 1 ? $eR : $sR; - - $loQ = $dQ == 1 ? $sQ : $eQ; - $hiQ = $dQ == 1 ? $eQ : $sQ; - - if ($OPT_ONLY_USE_FATTEST) - { - #-- Check to see if there is another better alignment - if (exists $qc{$idQ}) - { - my ($oldR) = keys %{$qc{$idQ}[2]}; - my $val = $qc{$idQ}[2]{$oldR}; - - if (${$val->[4]} - ${$val->[3]} > $hiR - $loR) - { - #-- Old alignment is better, skip this one - next; - } - else - { - #-- This alignment is better, prune old alignment - delete $rc{$oldR}[2]{$idQ}; - delete $qc{$idQ}; - } - } - } - - #-- initialize - if ( !exists $rc{$idR} ) { $rc{$idR} = [ 0, $lenR, { } ]; } - if ( !exists $qc{$idQ} ) { $qc{$idQ} = [ 0, $lenQ, { } ]; } - - #-- if no alignments for these two exist OR - #-- this alignment is bigger than the current - if ( !exists $rc{$idR}[2]{$idQ} || !exists $qc{$idQ}[2]{$idR} || - $hiR - $loR > - ${$rc{$idR}[2]{$idQ}[2]} - ${$rc{$idR}[2]{$idQ}[1]} ) { - - #-- rc and qc reference these anonymous values - my $aref = [ $slope, $loR, $hiR, $loQ, $hiQ ]; - - #-- rc is ordered [ slope, loR, hiR, loQ, hiQ ] - #-- qc is ordered [ slope, loQ, hiQ, loR, hiR ] - $rc{$idR}[2]{$idQ}[0] = $qc{$idQ}[2]{$idR}[0] = \$aref->[0]; - $rc{$idR}[2]{$idQ}[1] = $qc{$idQ}[2]{$idR}[3] = \$aref->[1]; - $rc{$idR}[2]{$idQ}[2] = $qc{$idQ}[2]{$idR}[4] = \$aref->[2]; - $rc{$idR}[2]{$idQ}[3] = $qc{$idQ}[2]{$idR}[1] = \$aref->[3]; - $rc{$idR}[2]{$idQ}[4] = $qc{$idQ}[2]{$idR}[2] = \$aref->[4]; - } - } - - close (BTAB) - or print STDERR "WARNING: Trouble closing show-coords pipe, $!\n"; - - #-- recursively span sequences to generate the layout - foreach $idR ( sort { $rc{$b}[1] <=> $rc{$a}[1] } keys %rc ) { - SpanXwY ($idR, \%rc, \@rl, \%qc, \@ql); - } - - #-- undefine the current offsets - foreach $idR ( keys %{$rref} ) { undef $rref->{$idR}[0]; } - foreach $idQ ( keys %{$qref} ) { undef $qref->{$idQ}[0]; } - - #-- redefine the offsets according to the new layout - my $roff = 0; - foreach my $r ( @rl ) { - $idR = $r->[0]; - $rref->{$idR}[0] = $roff; - $rref->{$idR}[2] = $r->[1]; - $roff += $rref->{$idR}[1] - 1; - } - #-- append the guys left out of the layout - foreach $idR ( keys %{$rref} ) { - if ( !defined $rref->{$idR}[0] ) { - $rref->{$idR}[0] = $roff; - $roff += $rref->{$idR}[1] - 1; - } - } - - #-- redefine the offsets according to the new layout - my $qoff = 0; - foreach my $q ( @ql ) { - $idQ = $q->[0]; - $qref->{$idQ}[0] = $qoff; - $qref->{$idQ}[2] = $q->[1]; - $qoff += $qref->{$idQ}[1] - 1; - } - #-- append the guys left out of the layout - foreach $idQ ( keys %{$qref} ) { - if ( !defined $qref->{$idQ}[0] ) { - $qref->{$idQ}[0] = $qoff; - $qoff += $qref->{$idQ}[1] - 1; - } - } -} - - -#----------------------------------------------------------------- SpanXwY ----# -sub SpanXwY ($$$$$) { - my $x = shift; # idX - my $xcr = shift; # xc ref - my $xlr = shift; # xl ref - my $ycr = shift; # yc ref - my $ylr = shift; # yl ref - - my @post; - foreach my $y ( sort { ${$xcr->{$x}[2]{$a}[1]} <=> ${$xcr->{$x}[2]{$b}[1]} } - keys %{$xcr->{$x}[2]} ) { - - #-- skip if already placed (RECURSION BASE) - if ( $ycr->{$y}[0] ) { next; } - else { $ycr->{$y}[0] = 1; } - - #-- get len and slope info for y - my $len = $ycr->{$y}[1]; - my $slope = ${$xcr->{$x}[2]{$y}[0]}; - - #-- if we need to flip, reverse complement all y records - if ( $slope == -1 ) { - foreach my $xx ( keys %{$ycr->{$y}[2]} ) { - ${$ycr->{$y}[2]{$xx}[0]} *= -1; - - my $loy = ${$ycr->{$y}[2]{$xx}[1]}; - my $hiy = ${$ycr->{$y}[2]{$xx}[2]}; - ${$ycr->{$y}[2]{$xx}[1]} = $len - $hiy + 1; - ${$ycr->{$y}[2]{$xx}[2]} = $len - $loy + 1; - } - } - - #-- place y - push @{$ylr}, [ $y, $slope ]; - - #-- RECURSE if y > x, else save for later - if ( $len > $xcr->{$x}[1] ) { SpanXwY ($y, $ycr, $ylr, $xcr, $xlr); } - else { push @post, $y; } - } - - #-- RECURSE for all y < x - foreach my $y ( @post ) { SpanXwY ($y, $ycr, $ylr, $xcr, $xlr); } -} - - -#---------------------------------------------------------------- PlotData ----# -sub PlotData ($$$) -{ - my $aref = shift; - my $rref = shift; - my $qref = shift; - - print STDERR "Writing plot files $OPT_Ffile, $OPT_Rfile", - (defined $OPT_Hfile ? ", $OPT_Hfile\n" : "\n"); - - open (FFILE, ">$OPT_Ffile") - or die "ERROR: Could not open $OPT_Ffile, $!\n"; - print FFILE "#-- forward hits sorted by %sim\n0 0 0\n0 0 0\n\n\n"; - - open (RFILE, ">$OPT_Rfile") - or die "ERROR: Could not open $OPT_Rfile, $!\n"; - print RFILE "#-- reverse hits sorted by %sim\n0 0 0\n0 0 0\n\n\n"; - - if ( defined $OPT_Hfile ) { - open (HFILE, ">$OPT_Hfile") - or die "ERROR: Could not open $OPT_Hfile, $!\n"; - print HFILE "#-- highlighted hits sorted by %sim\n0 0 0\n0 0 0\n\n\n"; - } - - my $fh; - my $align; - my $isplotted; - my $ismultiref; - my $ismultiqry; - my ($plenR, $plenQ, $pidR, $pidQ); - - #-- for each alignment sorted by ascending identity - foreach $align ( sort { $a->[4] <=> $b->[4] } @$aref ) { - - my ($sR, $eR, $sQ, $eQ, $sim, $lenR, $lenQ, $idR, $idQ) = @$align; - - if ( ! defined $pidR ) { - ($plenR, $plenQ, $pidR, $pidQ) = ($lenR, $lenQ, $idR, $idQ); - } - - #-- set the sequence offset, length, direction, etc... - my ($refoff, $reflen, $refdir); - my ($qryoff, $qrylen, $qrydir); - - if ( defined (%$rref) ) { - #-- skip reference sequence or set atts from hash - if ( !exists ($rref->{$idR}) ) { next; } - else { ($refoff, $reflen, $refdir) = @{$rref->{$idR}}; } - } - else { - #-- no reference hash, so default atts - ($refoff, $reflen, $refdir) = (0, $lenR, 1); - } - - if ( defined (%$qref) ) { - #-- skip query sequence or set atts from hash - if ( !exists ($qref->{$idQ}) ) { next; } - else { ($qryoff, $qrylen, $qrydir) = @{$qref->{$idQ}}; } - } - else { - #-- no query hash, so default atts - ($qryoff, $qrylen, $qrydir) = (0, $lenQ, 1); - } - - #-- get the orientation right - if ( $refdir == -1 ) { - $sR = $reflen - $sR + 1; - $eR = $reflen - $eR + 1; - } - if ( $qrydir == -1 ) { - $sQ = $qrylen - $sQ + 1; - $eQ = $qrylen - $eQ + 1; - } - - #-- forward file, reverse file, highlight file - my @fha; - - if ( defined $OPT_breaklen && - ( ($sR - 1 > $OPT_breaklen && - $sQ - 1 > $OPT_breaklen && - $reflen - $sR > $OPT_breaklen && - $qrylen - $sQ > $OPT_breaklen) - || - ($eR - 1 > $OPT_breaklen && - $eQ - 1 > $OPT_breaklen && - $reflen - $eR > $OPT_breaklen && - $qrylen - $eQ > $OPT_breaklen) ) ) { - push @fha, \*HFILE; - } - - push @fha, (($sR < $eR) == ($sQ < $eQ) ? \*FFILE : \*RFILE); - - #-- plot it - $sR += $refoff; $eR += $refoff; - $sQ += $qryoff; $eQ += $qryoff; - - if ( $OPT_coverage ) { - foreach $fh ( @fha ) { - print $fh - "$sR 10 $sim\n", "$eR 10 $sim\n\n\n", - "$sR $sim 0\n", "$eR $sim 0\n\n\n"; - } - } - else { - foreach $fh ( @fha ) { - print $fh "$sR $sQ $sim\n", "$eR $eQ $sim\n\n\n"; - } - } - - #-- set some flags - if ( !$ismultiref && $idR ne $pidR ) { $ismultiref = 1; } - if ( !$ismultiqry && $idQ ne $pidQ ) { $ismultiqry = 1; } - if ( !$isplotted ) { $isplotted = 1; } - } - - - #-- highlight the SNPs - if ( defined $OPT_SNP ) { - - print STDERR "Determining SNPs from sequence and alignment data\n"; - - open (SNPS, "$BIN_DIR/show-snps -H -T -l $OPT_Mfile |") - or die "ERROR: Could not open show-snps pipe, $!\n"; - - my @snps; - my ($pR, $pQ, $lenR, $lenQ, $idR, $idQ); - while ( ) { - chomp; - @snps = split "\t"; - if ( scalar @snps != 14 ) { - die "ERROR: Could not read show-snps pipe, invalid format\n"; - } - - $pR = $snps[0]; $pQ = $snps[3]; - $lenR = $snps[8]; $lenQ = $snps[9]; - $idR = $snps[12]; $idQ = $snps[13]; - - #-- set the sequence offset, length, direction, etc... - my ($refoff, $reflen, $refdir); - my ($qryoff, $qrylen, $qrydir); - - if ( defined (%$rref) ) { - #-- skip reference sequence or set atts from hash - if ( !exists ($rref->{$idR}) ) { next; } - else { ($refoff, $reflen, $refdir) = @{$rref->{$idR}}; } - } - else { - #-- no reference hash, so default atts - ($refoff, $reflen, $refdir) = (0, $lenR, 1); - } - - if ( defined (%$qref) ) { - #-- skip query sequence or set atts from hash - if ( !exists ($qref->{$idQ}) ) { next; } - else { ($qryoff, $qrylen, $qrydir) = @{$qref->{$idQ}}; } - } - else { - #-- no query hash, so default atts - ($qryoff, $qrylen, $qrydir) = (0, $lenQ, 1); - } - - #-- get the orientation right - if ( $refdir == -1 ) { $pR = $reflen - $pR + 1; } - if ( $qrydir == -1 ) { $pQ = $qrylen - $pQ + 1; } - - #-- plot it - $pR += $refoff; - $pQ += $qryoff; - - if ( $OPT_coverage ) { - print HFILE "$pR 10 0\n", "$pR 10 0\n\n\n", - } - else { - print HFILE "$pR $pQ 0\n", "$pR $pQ 0\n\n\n"; - } - } - - close (SNPS) - or print STDERR "WARNING: Trouble closing show-snps pipe, $!\n"; - } - - - close (FFILE) - or print STDERR "WARNING: Trouble closing $OPT_Ffile, $!\n"; - - close (RFILE) - or print STDERR "WARNING: Trouble closing $OPT_Rfile, $!\n"; - - if ( defined $OPT_Hfile ) { - close (HFILE) - or print STDERR "WARNING: Trouble closing $OPT_Hfile, $!\n"; - } - - - if ( !defined (%$rref) ) { - if ( $ismultiref ) { - print STDERR - "WARNING: Multiple ref sequences overlaid, try -R or -r\n"; - } - elsif ( defined $pidR ) { - $rref->{$pidR} = [ 0, $plenR, 1 ]; - } - } - - if ( !defined (%$qref) ) { - if ( $ismultiqry && !$OPT_coverage ) { - print STDERR - "WARNING: Multiple qry sequences overlaid, try -Q, -q or -c\n"; - } - elsif ( defined $pidQ ) { - $qref->{$pidQ} = [ 0, $plenQ, 1 ]; - } - } - - if ( !$isplotted ) { - die "ERROR: No alignment data to plot\n"; - } -} - - -#----------------------------------------------------------------- WriteGP ----# -sub WriteGP ($$) -{ - my $rref = shift; - my $qref = shift; - - print STDERR "Writing gnuplot script $OPT_Gfile\n"; - - open (GFILE, ">$OPT_Gfile") - or die "ERROR: Could not open $OPT_Gfile, $!\n"; - - my ($FWD, $REV, $HLT) = (1, 2, 3); - my $SIZE = $TERMSIZE{$OPT_terminal}{$OPT_size}; - - #-- terminal specific stuff - my ($P_TERM, $P_SIZE, %P_PS, %P_LW); - foreach ( $OPT_terminal ) { - /^$X11/ and do { - $P_TERM = $OPT_gpstatus == 0 ? - "$X11 font \"$FFACE,$FSIZE\"" : "$X11"; - - %P_PS = ( $FWD => 1.0, $REV => 1.0, $HLT => 1.0 ); - - %P_LW = $OPT_coverage || $OPT_color ? - ( $FWD => 3.0, $REV => 3.0, $HLT => 3.0 ) : - ( $FWD => 2.0, $REV => 2.0, $HLT => 2.0 ); - - $P_SIZE = $OPT_coverage ? - "set size 1,1" : - "set size 1,1"; - - last; - }; - - /^$PS/ and do { - $P_TERM = defined $OPT_color && $OPT_color == 0 ? - "$PS monochrome" : "$PS color"; - $P_TERM .= $OPT_gpstatus == 0 ? - " solid \"$FFACE\" $FSIZE" : " solid \"$FFACE\" $FSIZE"; - - %P_PS = ( $FWD => 0.5, $REV => 0.5, $HLT => 0.5 ); - - %P_LW = $OPT_coverage || $OPT_color ? - ( $FWD => 4.0, $REV => 4.0, $HLT => 4.0 ) : - ( $FWD => 2.0, $REV => 2.0, $HLT => 2.0 ); - - $P_SIZE = $OPT_coverage ? - "set size ".(1.0 * $SIZE).",".(0.5 * $SIZE) : - "set size ".(1.0 * $SIZE).",".(1.0 * $SIZE); - - last; - }; - - /^$PNG/ and do { - $P_TERM = $OPT_gpstatus == 0 ? - "$PNG tiny size $SIZE,$SIZE" : "$PNG small"; - if ( defined $OPT_color && $OPT_color == 0 ) { - $P_TERM .= " xffffff x000000 x000000"; - $P_TERM .= " x000000 x000000 x000000"; - $P_TERM .= " x000000 x000000 x000000"; - } - - %P_PS = ( $FWD => 1.0, $REV => 1.0, $HLT => 1.0 ); - - %P_LW = $OPT_coverage || $OPT_color ? - ( $FWD => 3.0, $REV => 3.0, $HLT => 3.0 ) : - ( $FWD => 3.0, $REV => 3.0, $HLT => 3.0 ); - - $P_SIZE = $OPT_coverage ? - "set size 1,.375" : - "set size 1,1"; - - last; - }; - - die "ERROR: Don't know how to initialize terminal, $OPT_terminal\n"; - } - - #-- plot commands - my ($P_WITH, $P_FORMAT, $P_LS, $P_KEY, %P_PT, %P_LT); - - %P_PT = ( $FWD => 6, $REV => 6, $HLT => 6 ); - %P_LT = defined $OPT_Hfile ? - ( $FWD => 2, $REV => 2, $HLT => 1 ) : - ( $FWD => 1, $REV => 3, $HLT => 2 ); - - $P_WITH = $OPT_coverage || $OPT_color ? "w l" : "w lp"; - - $P_FORMAT = "set format \"$TFORMAT\""; - if ( $OPT_gpstatus == 0 ) { - $P_LS = "set style line"; - $P_KEY = "unset key"; - $P_FORMAT .= "\nset mouse format \"$TFORMAT\""; - $P_FORMAT .= "\nset mouse mouseformat \"$MFORMAT\""; - $P_FORMAT .= "\nset mouse clipboardformat \"$MFORMAT\""; - } - else { - $P_LS = "set linestyle"; - $P_KEY = "set nokey"; - } - - - my @refk = keys (%$rref); - my @qryk = keys (%$qref); - my ($xrange, $yrange); - my ($xlabel, $ylabel); - my ($tic, $dir); - my $border = 0; - - #-- terminal header and output - print GFILE "set terminal $P_TERM\n"; - - if ( defined $OPT_Pfile ) { - print GFILE "set output \"$OPT_Pfile\"\n"; - } - - if ( defined $OPT_title ) { - print GFILE "set title \"$OPT_title\"\n"; - } - - #-- set tics, determine labels, ranges (ref) - if ( scalar (@refk) == 1 ) { - $xlabel = $refk[0]; - $xrange = $rref->{$xlabel}[1]; - } - else { - $xrange = 0; - print GFILE "set xtics rotate \( \\\n"; - foreach $xlabel ( sort { $rref->{$a}[0] <=> $rref->{$b}[0] } @refk ) { - $xrange += $rref->{$xlabel}[1]; - $tic = $rref->{$xlabel}[0] + 1; - $dir = ($rref->{$xlabel}[2] == 1) ? "" : "*"; - print GFILE " \"$dir$xlabel\" $tic, \\\n"; - } - print GFILE " \"\" $xrange \\\n\)\n"; - $xlabel = "REF"; - } - if ( $xrange == 0 ) { $xrange = "*"; } - - #-- set tics, determine labels, ranges (qry) - if ( $OPT_coverage ) { - $ylabel = "%SIM"; - $yrange = 110; - } - elsif ( scalar (@qryk) == 1 ) { - $ylabel = $qryk[0]; - $yrange = $qref->{$ylabel}[1]; - } - else { - $yrange = 0; - print GFILE "set ytics \( \\\n"; - foreach $ylabel ( sort { $qref->{$a}[0] <=> $qref->{$b}[0] } @qryk ) { - $yrange += $qref->{$ylabel}[1]; - $tic = $qref->{$ylabel}[0] + 1; - $dir = ($qref->{$ylabel}[2] == 1) ? "" : "*"; - print GFILE " \"$dir$ylabel\" $tic, \\\n"; - } - print GFILE " \"\" $yrange \\\n\)\n"; - $ylabel = "QRY"; - } - if ( $yrange == 0 ) { $yrange = "*"; } - - #-- determine borders - if ( $xrange ne "*" && scalar (@refk) == 1 ) { $border |= 10; } - if ( $yrange ne "*" && scalar (@qryk) == 1 ) { $border |= 5; } - if ( $OPT_coverage ) { $border |= 5; } - - #-- grid, labels, border - print GFILE - "$P_SIZE\n", - "set grid\n", - "$P_KEY\n", - "set border $border\n", - "set tics scale 0\n", - "set xlabel \"$xlabel\"\n", - "set ylabel \"$ylabel\"\n", - "$P_FORMAT\n"; - - #-- ranges - if ( defined $OPT_xrange ) { print GFILE "set xrange $OPT_xrange\n"; } - else { print GFILE "set xrange [1:$xrange]\n"; } - - if ( defined $OPT_yrange ) { print GFILE "set yrange $OPT_yrange\n"; } - else { print GFILE "set yrange [1:$yrange]\n"; } - - #-- if %sim plot - if ( $OPT_color ) { - print GFILE - "set zrange [0:100]\n", - "set colorbox default\n", - "set cblabel \"%similarity\"\n", - "set cbrange [0:100]\n", - "set cbtics 20\n", - "set pm3d map\n", - "set palette model RGB defined ( \\\n", - " 0 \"#000000\", \\\n", - " 4 \"#DD00DD\", \\\n", - " 6 \"#0000DD\", \\\n", - " 7 \"#00DDDD\", \\\n", - " 8 \"#00DD00\", \\\n", - " 9 \"#DDDD00\", \\\n", - " 10 \"#DD0000\" \\\n)\n"; - } - - foreach my $s ( ($FWD, $REV, $HLT) ) { - my $ss = "$P_LS $s "; - $ss .= $OPT_color ? " palette" : " lt $P_LT{$s}"; - $ss .= " lw $P_LW{$s}"; - if ( ! $OPT_coverage || $s == $HLT ) { - $ss .= " pt $P_PT{$s} ps $P_PS{$s}"; - } - print GFILE "$ss\n"; - } - - #-- plot it - print GFILE - ($OPT_color ? "splot \\\n" : "plot \\\n"); - print GFILE - " \"$OPT_Ffile\" title \"FWD\" $P_WITH ls $FWD, \\\n", - " \"$OPT_Rfile\" title \"REV\" $P_WITH ls $REV", - (! defined $OPT_Hfile ? "\n" : - ", \\\n \"$OPT_Hfile\" title \"HLT\" w lp ls $HLT"); - - #-- interactive mode - if ( $OPT_terminal eq $X11 ) { - print GFILE "\n", - "print \"-- INTERACTIVE MODE --\"\n", - "print \"consult gnuplot docs for command list\"\n", - "print \"mouse 1: coords to clipboard\"\n", - "print \"mouse 2: mark on plot\"\n", - "print \"mouse 3: zoom box\"\n", - "print \"'h' for help in plot window\"\n", - "print \"enter to exit\"\n", - "pause -1\n"; - } - - close (GFILE) - or print STDERR "WARNING: Trouble closing $OPT_Gfile, $!\n"; -} - - -#------------------------------------------------------------------- RunGP ----# -sub RunGP ( ) -{ - if ( defined $OPT_Pfile ) { - print STDERR "Rendering plot $OPT_Pfile\n"; - } - else { - print STDERR "Rendering plot to screen\n"; - } - - my $cmd = "gnuplot"; - - #-- x11 specifics - if ( $OPT_terminal eq $X11 ) { - my $size = $TERMSIZE{$OPT_terminal}{$OPT_size}; - $cmd .= " -geometry ${size}x"; - if ( $OPT_coverage ) { $size = sprintf ("%.0f", $size * .375); } - $cmd .= "${size}+0+0 -title mummerplot"; - - if ( defined $OPT_color && $OPT_color == 0 ) { - $cmd .= " -mono"; - $cmd .= " -xrm 'gnuplot*line1Dashes: 0'"; - $cmd .= " -xrm 'gnuplot*line2Dashes: 0'"; - $cmd .= " -xrm 'gnuplot*line3Dashes: 0'"; - } - - if ( $OPT_rv ) { - $cmd .= " -rv"; - $cmd .= " -xrm 'gnuplot*background: black'"; - $cmd .= " -xrm 'gnuplot*textColor: white'"; - $cmd .= " -xrm 'gnuplot*borderColor: white'"; - $cmd .= " -xrm 'gnuplot*axisColor: white'"; - } - } - - $cmd .= " $OPT_Gfile"; - - system ($cmd) - and print STDERR "WARNING: Unable to run '$cmd', $!\n"; -} - - -#---------------------------------------------------------------- ListenGP ----# -sub ListenGP($$) -{ - my $rref = shift; - my $qref = shift; - - my ($refc, $qryc); - my ($refid, $qryid); - my ($rsock, $qsock); - my $oldclip = ""; - - #-- get IDs sorted by offset - my @refo = sort { $rref->{$a}[0] <=> $rref->{$b}[0] } keys %$rref; - my @qryo = sort { $qref->{$a}[0] <=> $qref->{$b}[0] } keys %$qref; - - #-- attempt to connect sockets - if ( $OPT_rport ) { - $rsock = IO::Socket::INET->new("localhost:$OPT_rport") - or print STDERR "WARNING: Could not connect to rport $OPT_rport\n"; - } - - if ( $OPT_qport ) { - $qsock = IO::Socket::INET->new("localhost:$OPT_qport") - or print STDERR "WARNING: Could not connect to qport $OPT_qport\n"; - } - - #-- while parent still exists - while ( getppid != 1 ) { - - #-- query the clipboard - $_ = `xclip -o -silent -selection primary`; - if ( $? >> 8 ) { - die "WARNING: Unable to query clipboard with xclip\n"; - } - - #-- if cliboard has changed and contains a coordinate - if ( $_ ne $oldclip && (($refc, $qryc) = /^\[(\d+), (\d+)\]/) ) { - - $oldclip = $_; - - #-- translate the reference position - $refid = "NULL"; - for ( my $i = 0; $i < (scalar @refo); ++ $i ) { - my $aref = $rref->{$refo[$i]}; - if ( $i == $#refo || $aref->[0] + $aref->[1] > $refc ) { - $refid = $refo[$i]; - $refc -= $aref->[0]; - if ( $aref->[2] == -1 ) { - $refc = $aref->[1] - $refc + 1; - } - last; - } - } - - #-- translate the query position - $qryid = "NULL"; - for ( my $i = 0; $i < (scalar @qryo); ++ $i ) { - my $aref = $qref->{$qryo[$i]}; - if ( $i == $#qryo || $aref->[0] + $aref->[1] > $qryc ) { - $qryid = $qryo[$i]; - $qryc -= $aref->[0]; - if ( $aref->[2] == -1 ) { - $qryc = $aref->[1] - $qryc + 1; - } - last; - } - } - - #-- print the info to stdout and socket - print "$refid\t$qryid\t$refc\t$qryc\n"; - - if ( $rsock ) { - print $rsock "contig I$refid $refc\n"; - print "sent \"contig I$refid $refc\" to $OPT_rport\n"; - } - if ( $qsock ) { - print $qsock "contig I$qryid $qryc\n"; - print "sent \"contig I$qryid $qryc\" to $OPT_qport\n"; - } - } - - #-- sleep for half second - select undef, undef, undef, .5; - } - - exit (0); -} - - -#------------------------------------------------------------ ParseOptions ----# -sub ParseOptions ( ) -{ - my ($opt_small, $opt_medium, $opt_large); - my ($opt_ps, $opt_x11, $opt_png); - my $cnt; - - #-- Get options - my $err = $tigr -> TIGR_GetOptions - ( - "b|breaklen:i" => \$OPT_breaklen, - "color!" => \$OPT_color, - "c|coverage!" => \$OPT_coverage, - "f|filter!" => \$OPT_filter, - "l|layout!" => \$OPT_layout, - "p|prefix=s" => \$OPT_prefix, - "rv" => \$OPT_rv, - "r|IdR=s" => \$OPT_IdR, - "q|IdQ=s" => \$OPT_IdQ, - "R|Rfile=s" => \$OPT_IDRfile, - "Q|Qfile=s" => \$OPT_IDQfile, - "rport=i" => \$OPT_rport, - "qport=i" => \$OPT_qport, - "s|size=s" => \$OPT_size, - "S|SNP" => \$OPT_SNP, - "t|terminal=s" => \$OPT_terminal, - "title=s" => \$OPT_title, - "x|xrange=s" => \$OPT_xrange, - "y|yrange=s" => \$OPT_yrange, - "x11" => \$opt_x11, - "postscript" => \$opt_ps, - "png" => \$opt_png, - "small" => \$opt_small, - "medium" => \$opt_medium, - "large" => \$opt_large, - "fat" => \$OPT_ONLY_USE_FATTEST, - ); - - if ( !$err || scalar (@ARGV) != 1 ) { - $tigr -> printUsageInfo( ); - die "Try '$0 -h' for more information.\n"; - } - - $cnt = 0; - if ( $opt_png ) { $OPT_terminal = $PNG; $cnt ++; } - if ( $opt_ps ) { $OPT_terminal = $PS; $cnt ++; } - if ( $opt_x11 ) { $OPT_terminal = $X11; $cnt ++; } - if ( $cnt > 1 ) { - print STDERR - "WARNING: Multiple terminals not allowed, using '$OPT_terminal'\n"; - } - - $cnt = 0; - if ( $opt_large ) { $OPT_size = $LARGE; $cnt ++; } - if ( $opt_medium ) { $OPT_size = $MEDIUM; $cnt ++; } - if ( $opt_small ) { $OPT_size = $SMALL; $cnt ++; } - if ( $cnt > 1 ) { - print STDERR - "WARNING: Multiple sizes now allowed, using '$OPT_size'\n"; - } - - #-- Check that status of gnuplot - $OPT_gpstatus = system ("gnuplot --version"); - - if ( $OPT_gpstatus == -1 ) { - print STDERR - "WARNING: Could not find gnuplot, plot will not be rendered\n"; - } - elsif ( $OPT_gpstatus ) { - print STDERR - "WARNING: Using outdated gnuplot, use v4.0 for best results\n"; - - if ( $OPT_color ) { - print STDERR - "WARNING: Turning of --color option for compatibility\n"; - undef $OPT_color; - } - - if ( $OPT_terminal eq $PNG && $OPT_size ne $SMALL ) { - print STDERR - "WARNING: Turning of --size option for compatibility\n"; - $OPT_size = $SMALL; - } - } - - #-- Check options - if ( !exists $TERMSIZE{$OPT_terminal} ) { - die "ERROR: Invalid terminal type, $OPT_terminal\n"; - } - - if ( !exists $TERMSIZE{$OPT_terminal}{$OPT_size} ) { - die "ERROR: Invalid terminal size, $OPT_size\n"; - } - - if ( $OPT_xrange ) { - $OPT_xrange =~ tr/,/:/; - $OPT_xrange =~ /^\[\d+:\d+\]$/ - or die "ERROR: Invalid xrange format, $OPT_xrange\n"; - } - - if ( $OPT_yrange ) { - $OPT_yrange =~ tr/,/:/; - $OPT_yrange =~ /^\[\d+:\d+\]$/ - or die "ERROR: Invalid yrange format, $OPT_yrange\n"; - } - - #-- Set file names - $OPT_Mfile = $ARGV[0]; - $tigr->isReadableFile ($OPT_Mfile) - or die "ERROR: Could not read $OPT_Mfile, $!\n"; - - $OPT_Ffile = $OPT_prefix . $SUFFIX{$FWDPLOT}; - $tigr->isWritableFile ($OPT_Ffile) or $tigr->isCreatableFile ($OPT_Ffile) - or die "ERROR: Could not write $OPT_Ffile, $!\n"; - - $OPT_Rfile = $OPT_prefix . $SUFFIX{$REVPLOT}; - $tigr->isWritableFile ($OPT_Rfile) or $tigr->isCreatableFile ($OPT_Rfile) - or die "ERROR: Could not write $OPT_Rfile, $!\n"; - - if ( defined $OPT_breaklen || defined $OPT_SNP ) { - $OPT_Hfile = $OPT_prefix . $SUFFIX{$HLTPLOT}; - $tigr->isWritableFile($OPT_Hfile) or $tigr->isCreatableFile($OPT_Hfile) - or die "ERROR: Could not write $OPT_Hfile, $!\n"; - } - - if ($OPT_ONLY_USE_FATTEST) - { - $OPT_layout = 1; - } - - if ( $OPT_filter || $OPT_layout ) { - $OPT_Dfile = $OPT_prefix . $SUFFIX{$FILTER}; - $tigr->isWritableFile($OPT_Dfile) or $tigr->isCreatableFile($OPT_Dfile) - or die "ERROR: Could not write $OPT_Dfile, $!\n"; - } - - $OPT_Gfile = $OPT_prefix . $SUFFIX{$GNUPLOT}; - $tigr->isWritableFile ($OPT_Gfile) or $tigr->isCreatableFile ($OPT_Gfile) - or die "ERROR: Could not write $OPT_Gfile, $!\n"; - - if ( exists $SUFFIX{$OPT_terminal} ) { - $OPT_Pfile = $OPT_prefix . $SUFFIX{$OPT_terminal}; - $tigr->isWritableFile($OPT_Pfile) or $tigr->isCreatableFile($OPT_Pfile) - or die "ERROR: Could not write $OPT_Pfile, $!\n"; - } - - if ( defined $OPT_IDRfile ) { - $tigr->isReadableFile ($OPT_IDRfile) - or die "ERROR: Could not read $OPT_IDRfile, $!\n"; - } - - if ( defined $OPT_IDQfile ) { - $tigr->isReadableFile ($OPT_IDQfile) - or die "ERROR: Could not read $OPT_IDQfile, $!\n"; - } - - if ( (defined $OPT_rport || defined $OPT_qport) && - ($OPT_terminal ne $X11 || $OPT_gpstatus ) ) { - print STDERR - "WARNING: Port options available only for v4.0 X11 plots\n"; - undef $OPT_rport; - undef $OPT_qport; - } - - - if ( defined $OPT_color && defined $OPT_Hfile ) { - print STDERR - "WARNING: Turning off --color option so highlighting is visible\n"; - undef $OPT_color; - } -} diff --git a/tools/MUMmer3.23/nucmer b/tools/MUMmer3.23/nucmer deleted file mode 100755 index b396406..0000000 --- a/tools/MUMmer3.23/nucmer +++ /dev/null @@ -1,394 +0,0 @@ -#!/usr/bin/perl - -#------------------------------------------------------------------------------- -# Programmer: Adam M Phillippy, The Institute for Genomic Research -# File: nucmer -# Date: 04 / 09 / 03 -# -# Usage: -# nucmer [options] -# -# Try 'nucmer -h' for more information. -# -# Purpose: To create alignments between two multi-FASTA inputs by using -# the MUMmer matching and clustering algorithms. -# -#------------------------------------------------------------------------------- - -use lib "/export/home/zqhu/tools/MUMmer3.23/scripts"; -use Foundation; -use File::Spec::Functions; -use strict; - -my $AUX_BIN_DIR = "/export/home/zqhu/tools/MUMmer3.23/aux_bin"; -my $BIN_DIR = "/export/home/zqhu/tools/MUMmer3.23"; -my $SCRIPT_DIR = "/export/home/zqhu/tools/MUMmer3.23/scripts"; - - -my $VERSION_INFO = q~ -NUCmer (NUCleotide MUMmer) version 3.1 - ~; - - -my $HELP_INFO = q~ - USAGE: nucmer [options] - - DESCRIPTION: - nucmer generates nucleotide alignments between two mutli-FASTA input - files. The out.delta output file lists the distance between insertions - and deletions that produce maximal scoring alignments between each - sequence. The show-* utilities know how to read this format. - - MANDATORY: - Reference Set the input reference multi-FASTA filename - Query Set the input query multi-FASTA filename - - OPTIONS: - --mum Use anchor matches that are unique in both the reference - and query - --mumcand Same as --mumreference - --mumreference Use anchor matches that are unique in in the reference - but not necessarily unique in the query (default behavior) - --maxmatch Use all anchor matches regardless of their uniqueness - - -b|breaklen Set the distance an alignment extension will attempt to - extend poor scoring regions before giving up (default 200) - --[no]banded Enforce absolute banding of dynamic programming matrix - based on diagdiff parameter EXPERIMENTAL (default no) - -c|mincluster Sets the minimum length of a cluster of matches (default 65) - --[no]delta Toggle the creation of the delta file (default --delta) - --depend Print the dependency information and exit - -D|diagdiff Set the maximum diagonal difference between two adjacent - anchors in a cluster (default 5) - -d|diagfactor Set the maximum diagonal difference between two adjacent - anchors in a cluster as a differential fraction of the gap - length (default 0.12) - --[no]extend Toggle the cluster extension step (default --extend) - -f - --forward Use only the forward strand of the Query sequences - -g|maxgap Set the maximum gap between two adjacent matches in a - cluster (default 90) - -h - --help Display help information and exit - -l|minmatch Set the minimum length of a single match (default 20) - -o - --coords Automatically generate the original NUCmer1.1 coords - output file using the 'show-coords' program - --[no]optimize Toggle alignment score optimization, i.e. if an alignment - extension reaches the end of a sequence, it will backtrack - to optimize the alignment score instead of terminating the - alignment at the end of the sequence (default --optimize) - -p|prefix Set the prefix of the output files (default "out") - -r - --reverse Use only the reverse complement of the Query sequences - --[no]simplify Simplify alignments by removing shadowed clusters. Turn - this option off if aligning a sequence to itself to look - for repeats (default --simplify) - -V - --version Display the version information and exit - ~; - - -my $USAGE_INFO = q~ - USAGE: nucmer [options] - ~; - - -my @DEPEND_INFO = - ( - "$BIN_DIR/mummer", - "$BIN_DIR/mgaps", - "$BIN_DIR/show-coords", - "$AUX_BIN_DIR/postnuc", - "$AUX_BIN_DIR/prenuc", - "$SCRIPT_DIR/Foundation.pm" - ); - - -my %DEFAULT_PARAMETERS = - ( - "OUTPUT_PREFIX" => "out", # prefix for all output files - "MATCH_ALGORITHM" => "-mumreference", # match finding algo switch - "MATCH_DIRECTION" => "-b", # match direction switch - "MIN_MATCH" => "20", # minimum match size - "MAX_GAP" => "90", # maximum gap between matches - "MIN_CLUSTER" => "65", # minimum cluster size - "DIAG_DIFF" => "5", # diagonal difference absolute - "DIAG_FACTOR" => ".12", # diagonal difference fraction - "BREAK_LEN" => "200", # extension break length - "POST_SWITCHES" => "" # switches for the post processing - ); - - -sub main ( ) -{ - my $tigr; # TIGR::Foundation object - my @err; # Error variable - - my $ref_file; # path of the reference input file - my $qry_file; # path of the query input file - - #-- The command line options for the various programs - my $pfx = $DEFAULT_PARAMETERS { "OUTPUT_PREFIX" }; - my $algo = $DEFAULT_PARAMETERS { "MATCH_ALGORITHM" }; - my $mdir = $DEFAULT_PARAMETERS { "MATCH_DIRECTION" }; - my $size = $DEFAULT_PARAMETERS { "MIN_MATCH" }; - my $gap = $DEFAULT_PARAMETERS { "MAX_GAP" }; - my $clus = $DEFAULT_PARAMETERS { "MIN_CLUSTER" }; - my $ddiff = $DEFAULT_PARAMETERS { "DIAG_DIFF" }; - my $dfrac = $DEFAULT_PARAMETERS { "DIAG_FACTOR" }; - my $blen = $DEFAULT_PARAMETERS { "BREAK_LEN" }; - my $psw = $DEFAULT_PARAMETERS { "POST_SWITCHES" }; - - my $fwd; # if true, use forward strand - my $rev; # if true, use reverse strand - my $maxmatch; # matching algorithm switches - my $mumreference; - my $mum; - my $banded = 0; # if true, enforce absolute dp banding - my $extend = 1; # if true, extend clusters - my $delta = 1; # if true, create the delta file - my $optimize = 1; # if true, optimize alignment scores - my $simplify = 1; # if true, simplify shadowed alignments - - my $generate_coords; - - #-- Initialize TIGR::Foundation - $tigr = new TIGR::Foundation; - if ( !defined ($tigr) ) { - print (STDERR "ERROR: TIGR::Foundation could not be initialized"); - exit (1); - } - - #-- Set help and usage information - $tigr->setHelpInfo ($HELP_INFO); - $tigr->setUsageInfo ($USAGE_INFO); - $tigr->setVersionInfo ($VERSION_INFO); - $tigr->addDependInfo (@DEPEND_INFO); - - #-- Get command line parameters - $err[0] = $tigr->TIGR_GetOptions - ( - "maxmatch" => \$maxmatch, - "mumcand" => \$mumreference, - "mumreference" => \$mumreference, - "mum" => \$mum, - "b|breaklen=i" => \$blen, - "banded!" => \$banded, - "c|mincluster=i" => \$clus, - "delta!" => \$delta, - "D|diagdiff=i" => \$ddiff, - "d|diagfactor=f" => \$dfrac, - "extend!" => \$extend, - "f|forward" => \$fwd, - "g|maxgap=i" => \$gap, - "l|minmatch=i" => \$size, - "o|coords" => \$generate_coords, - "optimize!" => \$optimize, - "p|prefix=s" => \$pfx, - "r|reverse" => \$rev, - "simplify!" => \$simplify - ); - - - #-- Check if the parsing was successful - if ( $err[0] == 0 || $#ARGV != 1 ) { - $tigr->printUsageInfo( ); - print (STDERR "Try '$0 -h' for more information.\n"); - exit (1); - } - - $ref_file = File::Spec->rel2abs ($ARGV[0]); - $qry_file = File::Spec->rel2abs ($ARGV[1]); - - #-- Set up the program parameters - if ( $fwd && $rev ) { - $mdir = "-b"; - } elsif ( $fwd ) { - $mdir = ""; - } elsif ( $rev ) { - $mdir = "-r"; - } - if ( ! $extend ) { - $psw .= "-e "; - } - if ( ! $delta ) { - $psw .= "-d "; - } - if ( ! $optimize ) { - $psw .= "-t "; - } - if ( ! $simplify ) { - $psw .= "-s "; - } - - undef (@err); - $err[0] = 0; - if ( $mum ) { - $err[0] ++; - $algo = "-mum"; - } - if ( $mumreference ) { - $err[0] ++; - $algo = "-mumreference"; - } - if ( $maxmatch ) { - $err[0] ++; - $algo = "-maxmatch"; - } - if ( $err[0] > 1 ) { - $tigr->printUsageInfo( ); - print (STDERR "ERROR: Multiple matching algorithms selected\n"); - print (STDERR "Try '$0 -h' for more information.\n"); - exit (1); - } - - #-- Set up the program path names - my $algo_path = "$BIN_DIR/mummer"; - my $mgaps_path = "$BIN_DIR/mgaps"; - my $prenuc_path = "$AUX_BIN_DIR/prenuc"; - my $postnuc_path = "$AUX_BIN_DIR/postnuc"; - my $showcoords_path = "$BIN_DIR/show-coords"; - - #-- Check that the files needed are all there and readable/writable - { - undef (@err); - if ( !$tigr->isExecutableFile ($algo_path) ) { - push (@err, $algo_path); - } - - if ( !$tigr->isExecutableFile ($mgaps_path) ) { - push (@err, $mgaps_path); - } - - if ( !$tigr->isExecutableFile ($prenuc_path) ) { - push (@err, $prenuc_path); - } - - if ( !$tigr->isExecutableFile ($postnuc_path) ) { - push (@err, $postnuc_path); - } - - if ( !$tigr->isReadableFile ($ref_file) ) { - push (@err, $ref_file); - } - - if ( !$tigr->isReadableFile ($qry_file) ) { - push (@err, $qry_file); - } - - if ( !$tigr->isCreatableFile ("$pfx.ntref") ) { - if ( !$tigr->isWritableFile ("$pfx.ntref") ) { - push (@err, "$pfx.ntref"); - } - } - - if ( !$tigr->isCreatableFile ("$pfx.mgaps") ) { - if ( !$tigr->isWritableFile ("$pfx.mgaps") ) { - push (@err, "$pfx.mgaps"); - } - } - - if ( !$tigr->isCreatableFile ("$pfx.delta") ) { - if ( !$tigr->isWritableFile ("$pfx.delta") ) { - push (@err, "$pfx.delta"); - } - } - - if ( $generate_coords ) { - if ( !$tigr->isExecutableFile ($showcoords_path) ) { - push (@err, $showcoords_path); - } - if ( !$tigr->isCreatableFile ("$pfx.coords") ) { - if ( !$tigr->isWritableFile ("$pfx.coords") ) { - push (@err, "$pfx.coords"); - } - } - } - - #-- If 1 or more files could not be processed, terminate script - if ( $#err >= 0 ) { - $tigr->logError - ("ERROR: The following critical files could not be used", 1); - while ( $#err >= 0 ) { - $tigr->logError (pop(@err), 1); - } - $tigr->logError - ("Check your paths and file permissions and try again", 1); - $tigr->bail( ); - } - } - - - #-- Run prenuc and assert return value is zero - print (STDERR "1: PREPARING DATA\n"); - $err[0] = $tigr->runCommand - ("$prenuc_path $ref_file > $pfx.ntref"); - - if ( $err[0] != 0 ) { - $tigr->bail - ("ERROR: prenuc returned non-zero\n"); - } - - - #-- Run mummer | mgaps and assert return value is zero - print (STDERR "2,3: RUNNING mummer AND CREATING CLUSTERS\n"); - open(ALGO_PIPE, "$algo_path $algo $mdir -l $size -n $pfx.ntref $qry_file |") - or $tigr->bail ("ERROR: could not open $algo_path output pipe $!"); - open(CLUS_PIPE, "| $mgaps_path -l $clus -s $gap -d $ddiff -f $dfrac > $pfx.mgaps") - or $tigr->bail ("ERROR: could not open $mgaps_path input pipe $!"); - while ( ) { - print CLUS_PIPE - or $tigr->bail ("ERROR: could not write to $mgaps_path pipe $!"); - } - $err[0] = close(ALGO_PIPE); - $err[1] = close(CLUS_PIPE); - - if ( $err[0] == 0 || $err[1] == 0 ) { - $tigr->bail ("ERROR: mummer and/or mgaps returned non-zero\n"); - } - - - #-- Run postnuc and assert return value is zero - print (STDERR "4: FINISHING DATA\n"); - if ( $banded ) - { - $err[0] = $tigr->runCommand - ("$postnuc_path $psw -b $blen -B $ddiff $ref_file $qry_file $pfx < $pfx.mgaps"); - } - else - { - $err[0] = $tigr->runCommand - ("$postnuc_path $psw -b $blen $ref_file $qry_file $pfx < $pfx.mgaps"); - } - - if ( $err[0] != 0 ) { - $tigr->bail ("ERROR: postnuc returned non-zero\n"); - } - - #-- If the -o flag was set, run show-coords using NUCmer1.1 settings - if ( $generate_coords ) { - print (STDERR "5: GENERATING COORDS FILE\n"); - $err[0] = $tigr->runCommand - ("$showcoords_path -r $pfx.delta > $pfx.coords"); - - if ( $err[0] != 0 ) { - $tigr->bail ("ERROR: show-coords returned non-zero\n"); - } - } - - #-- Remove the temporary output - $err[0] = unlink ("$pfx.ntref", "$pfx.mgaps"); - - if ( $err[0] != 2 ) { - $tigr->logError ("WARNING: there was a problem deleting". - " the temporary output files", 1); - } - - #-- Return success - return (0); -} - -exit ( main ( ) ); - -#-- END OF SCRIPT diff --git a/tools/MUMmer3.23/nucmer2xfig b/tools/MUMmer3.23/nucmer2xfig deleted file mode 100755 index 7ec16ae..0000000 --- a/tools/MUMmer3.23/nucmer2xfig +++ /dev/null @@ -1,139 +0,0 @@ -#!/usr/bin/perl -# (c) Steven Salzberg 2001 -# Make an xfig plot for a comparison of a reference chromosome (or single -# molecule) versus a multifasta file of contigs from another genome. -# The input file here is a NUCmer coords file such as: -# 2551 2577 | 240 266 | 27 27 | 96.30 | 20302755 1424 | 0.00 1.90 | 2R 1972084 -# generated by running 'show-coords -c -l ' -# For the above example, D. melanogaster chr 2R is the reference and the query -# is an assembly with 1000s of contigs from D. pseudoobscura -# The file needs to be sorted by the smaller contig ids, via: -# tail +6 Dmel2R-vs-Dpseudo.coords | sort -k 19n -k 4n > Dmel2R-vs-Dpseudo-resort.coords -# and remember to get rid of top 5 (header) lines. -# Usage: plot-drosoph-align-xfig.perl Dmel2R-vs-Dpseudo-resort.coords -unless (open(coordsfile,$ARGV[0])) { - die ("can't open file $ARGV[0].\n"); -} -$Xscale = 0.005; -$Yscale = 20; -$chrcolor = 4; # 4 is red, 1 is blue, 2 is green -$green = 2; -$contigcolor = 1; # contigs are blue - -# print header info -print "#FIG 3.2\nLandscape\nCenter\nInches\nLetter \n100.00\nSingle\n-2\n1200 2\n"; -$first_time = 1; -while () { - ($s1,$e1,$x1,$s2,$e2,$x2,$l1,$l2,$x3,$percent_id,$x4,$Rlen,$Qlen,$x5,$Rcov,$Qcov,$x6,$Rid,$Qid) = split(" "); - if ($prevQid eq $Qid) { # query contig is same as prev line - $dist = abs($s1 - $prev_s1); - if ( $dist > 2 * $Qlen) { # if this match is too far away - # print the contig here if the matching bit is > 1000 - if ($right_end{$Qid} - $left_end{$Qid} > 1000) { - # print it at y=50 rather than 100 because the scale is 0-50, - # where 0=50% identical and 50 is 100% identical - print_xfig_line($left_end{$Qid},50,$right_end{$Qid},50,$contigcolor); - print_label($left_end{$Qid},50,$right_end{$Qid},$Qid,5); - } - $left_end{$Qid} = $s1; # then re-set the start and end of the contig - $right_end{$Qid} = $e1; # and we'll print it again later - } - else { # extend the boundaries of the match - if ($s1 < $left_end{$Qid}) { $left_end{$Qid} = $s1; } - if ($e1 > $right_end{$Qid}) { $right_end{$Qid} = $e1; } - } - } - else { # this is a different contig, first time seeing it - $left_end{$Qid} = $s1; - $right_end{$Qid} = $e1; - # print the previous contig as a line at y=100 for 100% - if ($first_time < 1) { - print_xfig_line($left_end{$prevQid},50,$right_end{$prevQid},50,$contigcolor); - print_label($left_end{$prevQid},50,$right_end{$prevQid},$prevQid,5); - } - else { $first_time = 0; } - } - $prevQid = $Qid; - $prev_s1 = $s1; - $prev_Qlen = $Qlen; - # next print the matching bit as a separate line, with a separate color, - # with its height determined by percent match - $Xleft = int($Xscale * $s1); - $Xright = int($Xscale * $e1); - if ($Xleft == $Xright) { $Xright += 1; } - if ($percent_id < 50) { $percent_id = 50; } - print_xfig_line($s1,$percent_id - 50,$e1,$percent_id - 50,$green); -} -# print very last contig -$left_end{$Qid} = $s1; -$right_end{$Qid} = $e1; -print_xfig_line($s1,50,$e1,50,$contigcolor); -print_label($s1,50,$e1,$Qid,5); -close(coordsfile); - -# now draw the horizontal chr line for the reference -$label_xpos = $Xscale * $Rlen; -print "4 0 0 100 0 0 12 0.0000 4 135 405 "; -printf("%.0f %.0f",$label_xpos, 0); -print " ", $Rid, "\\001\n"; -print "2 1 0 2 $chrcolor 7 100 0 -1 0.000 0 0 -1 0 0 2\n"; -printf("\t %.0f %.0f %.0f %.0f\n", 0, 0, - $Xscale * $Rlen, 0); -# print some X-axis coordinates -$pointsize = 5; -for ($i = 250000; $i < $Rlen - 50000; $i+= 250000) { - print "4 0 0 100 0 0 $pointsize 0.0000 4 135 405 "; - printf("%.0f %.0f",$i * $Xscale + 20, -20 + ($Yscale * -0.5)); - print " $i", "\\001\n"; - #print a vertical tic mark - print "2 1 0 1 0 7 50 0 -1 0.000 0 0 -1 0 0 2\n"; - printf("%.0f %.0f %.0f -50\n", - $i * $Xscale, 0, $i * $Xscale); -} -# print tic marks indicating % identity on the y-axis -for ($percent_id = 50; $percent_id < 101; $percent_id += 10) { - print "4 0 0 100 0 0 $pointsize 0.0000 4 135 405 "; - printf("-150 %.0f", ($percent_id - 50) * $Yscale + 10); # shift down 10 pixels - print " $percent_id", "\\001\n"; - # print the tic mark - print "2 1 0 1 0 7 50 0 -1 0.000 0 0 -1 0 0 2\n"; - printf("-50 %.0f 0 %.0f\n", - ($percent_id - 50) * $Yscale, ($percent_id - 50) * $Yscale); -} - -# print a line in the appropriate color, scaled with Xscale,Yscale -sub print_xfig_line { - my ($xleft,$yleft,$xright,$yright,$color) = @_; - # print it at the given coordinates, scaled - $xleft_scaled = int($Xscale * $xleft); - $xright_scaled = int($Xscale * $xright); - # Xfig has a bug: if we re-scale and the resulting X coordinates are equal, - # then it will print a fixed-size (large) rectangle, rather than a 1-pixel - # wide one. So check fo this and correct. - if ($xleft_scaled == $xright_scaled) { $xright_scaled += 1; } - # set up and print line in xfig format - print "2 1 0 2 $color 7 100 0 -1 0.000 0 0 -1 0 0 2\n"; - printf("\t %.0f %.0f %.0f %.0f\n", - $xleft_scaled, $Yscale * $yleft, $xright_scaled, $Yscale * $yright); -} - -# print a label for each contig using its ID -sub print_label { - my ($xleft,$yleft,$xright,$id,$psize) = @_; - # print it at the left edge of the contig. The angle - # of 4.7124 means text goes down vertically. 5th argument - # here is pointsize of the label. - # bump the label down 20 pixels - $yposition = $yleft * $Yscale + 20; - # to keep the display clean, don't print the label unless the - # contig itself is wider than the width of the text in the label - $xleft_scaled = $xleft * $Xscale; - $xright_scaled = $xright * $Xscale; - # each point of type is 8 pixels - $textheight = $psize * 8; - if ($xright_scaled - $xleft_scaled > $textheight) { - print "4 0 0 50 0 0 $psize 4.7124 4 135 435 "; - printf("%.0f %.0f",$xleft_scaled, $yposition); - print " $id", "\\001\n"; - } -} diff --git a/tools/MUMmer3.23/promer b/tools/MUMmer3.23/promer deleted file mode 100755 index d317dd3..0000000 --- a/tools/MUMmer3.23/promer +++ /dev/null @@ -1,382 +0,0 @@ -#!/usr/bin/perl - -#------------------------------------------------------------------------------- -# Programmer: Adam M Phillippy, The Institute for Genomic Research -# File: promer -# Date: 04 / 09 / 03 -# -# Usage: -# promer [options] -# -# Try 'promer -h' for more information. -# -# Purpose: To create alignments between two multi-FASTA inputs by using -# the MUMmer matching and clustering algorithms. -# -#------------------------------------------------------------------------------- - -use lib "/export/home/zqhu/tools/MUMmer3.23/scripts"; -use Foundation; -use File::Spec::Functions; -use strict; - -my $AUX_BIN_DIR = "/export/home/zqhu/tools/MUMmer3.23/aux_bin"; -my $BIN_DIR = "/export/home/zqhu/tools/MUMmer3.23"; -my $SCRIPT_DIR = "/export/home/zqhu/tools/MUMmer3.23/scripts"; - - - -my $VERSION_INFO = q~ -PROmer (PROtein MUMmer) version 3.07 - ~; - - - -my $HELP_INFO = q~ - USAGE: promer [options] - - DESCRIPTION: - promer generates amino acid alignments between two mutli-FASTA DNA input - files. The out.delta output file lists the distance between insertions - and deletions that produce maximal scoring alignments between each - sequence. The show-* utilities know how to read this format. The DNA - input is translated into all 6 reading frames in order to generate the - output, but the output coordinates reference the original DNA input. - - MANDATORY: - Reference Set the input reference multi-FASTA DNA file - Query Set the input query multi-FASTA DNA file - - OPTIONS: - --mum Use anchor matches that are unique in both the reference - and query - --mumcand Same as --mumreference - --mumreference Use anchor matches that are unique in in the reference - but not necessarily unique in the query (default behavior) - --maxmatch Use all anchor matches regardless of their uniqueness - - -b|breaklen Set the distance an alignment extension will attempt to - extend poor scoring regions before giving up, measured in - amino acids (default 60) - -c|mincluster Sets the minimum length of a cluster of matches, measured in - amino acids (default 20) - --[no]delta Toggle the creation of the delta file (default --delta) - --depend Print the dependency information and exit - -d|diagfactor Set the clustering diagonal difference separation factor - (default .11) - --[no]extend Toggle the cluster extension step (default --extend) - -g|maxgap Set the maximum gap between two adjacent matches in a - cluster, measured in amino acids (default 30) - -h - --help Display help information and exit. - -l|minmatch Set the minimum length of a single match, measured in amino - acids (default 6) - -m|masklen Set the maximum bookend masking lenth, measured in amino - acids (default 8) - -o - --coords Automatically generate the original PROmer1.1 ".coords" - output file using the "show-coords" program - --[no]optimize Toggle alignment score optimization, i.e. if an alignment - extension reaches the end of a sequence, it will backtrack - to optimize the alignment score instead of terminating the - alignment at the end of the sequence (default --optimize) - - -p|prefix Set the prefix of the output files (default "out") - -V - --version Display the version information and exit - -x|matrix Set the alignment matrix number to 1 [BLOSUM 45], 2 [BLOSUM - 62] or 3 [BLOSUM 80] (default 2) - ~; - - -my $USAGE_INFO = q~ - USAGE: promer [options] - ~; - - -my @DEPEND_INFO = - ( - "$BIN_DIR/mummer", - "$BIN_DIR/mgaps", - "$BIN_DIR/show-coords", - "$AUX_BIN_DIR/postpro", - "$AUX_BIN_DIR/prepro", - "$SCRIPT_DIR/Foundation.pm" - ); - - -my %DEFAULT_PARAMETERS = - ( - "OUTPUT_PREFIX" => "out", # prefix for all output files - "MATCH_ALGORITHM" => "-mumreference", # match finding algo switch - "MIN_MATCH" => "6", # minimum match size (aminos) - "MAX_GAP" => "30", # maximum gap between matches (aminos) - "MIN_CLUSTER" => "20", # minimum cluster size (aminos) - "DIAG_FACTOR" => ".11", # diagonal difference fraction - "BREAK_LEN" => "60", # extension break length - "BLOSUM_NUMBER" => "2", # options are 1,2,3 (BLOSUM 45,62,80) - "MASKING_LENGTH" => "8", # set bookend masking length - "POST_SWITCHES" => "" # switches for the post processing - ); - - -sub main ( ) -{ - my $tigr; # TIGR::Foundation object - my @err; # Error variable - - my $ref_file; # path of the reference input file - my $qry_file; # path of the query input file - - #-- The command line options for the various programs - my $pfx = $DEFAULT_PARAMETERS { "OUTPUT_PREFIX" }; - my $algo = $DEFAULT_PARAMETERS { "MATCH_ALGORITHM" }; - my $size = $DEFAULT_PARAMETERS { "MIN_MATCH" }; - my $gap = $DEFAULT_PARAMETERS { "MAX_GAP" }; - my $clus = $DEFAULT_PARAMETERS { "MIN_CLUSTER" }; - my $diff = $DEFAULT_PARAMETERS { "DIAG_FACTOR" }; - my $blen = $DEFAULT_PARAMETERS { "BREAK_LEN" }; - my $blsm = $DEFAULT_PARAMETERS { "BLOSUM_NUMBER" }; - my $mask = $DEFAULT_PARAMETERS { "MASKING_LENGTH" }; - my $psw = $DEFAULT_PARAMETERS { "POST_SWITCHES" }; - - my $maxmatch; # matching algorithm switches - my $mumreference; - my $mum; - my $extend = 1; # if true, extend clusters - my $delta = 1; # if true, create the delta file - my $optimize = 1; # if true, optimize alignment scores - - my $generate_coords; - - #-- Initialize TIGR::Foundation - $tigr = new TIGR::Foundation; - if ( !defined ($tigr) ) { - print (STDERR "ERROR: TIGR::Foundation could not be initialized"); - exit (1); - } - - #-- Set help and usage information - $tigr->setHelpInfo ($HELP_INFO); - $tigr->setUsageInfo ($USAGE_INFO); - $tigr->setVersionInfo ($VERSION_INFO); - $tigr->addDependInfo (@DEPEND_INFO); - - #-- Get command line parameters - $err[0] = $tigr->TIGR_GetOptions - ( - "maxmatch" => \$maxmatch, - "mumcand" => \$mumreference, - "mumreference" => \$mumreference, - "mum" => \$mum, - "b|breaklen=i" => \$blen, - "c|mincluster=i" => \$clus, - "delta!" => \$delta, - "d|diagfactor=f" => \$diff, - "extend!" => \$extend, - "g|maxgap=i" => \$gap, - "l|minmatch=i" => \$size, - "m|masklen=i" => \$mask, - "o|coords" => \$generate_coords, - "optimize!" => \$optimize, - "p|prefix=s" => \$pfx, - "x|matrix=i" => \$blsm - ); - - #-- Check if the parsing was successful - if ( $err[0] == 0 || $#ARGV != 1 ) { - $tigr->printUsageInfo( ); - print (STDERR "Try '$0 -h' for more information.\n"); - exit (1); - } - - $ref_file = File::Spec->rel2abs ($ARGV[0]); - $qry_file = File::Spec->rel2abs ($ARGV[1]); - - #-- Set up the program parameters - if ( ! $extend ) { - $psw .= "-e "; - } - if ( ! $delta ) { - $psw .= "-d "; - } - if ( ! $optimize ) { - $psw .= "-t "; - } - - undef (@err); - $err[0] = 0; - if ( $mum ) { - $err[0] ++; - $algo = "-mum"; - } - if ( $mumreference ) { - $err[0] ++; - $algo = "-mumreference"; - } - if ( $maxmatch ) { - $err[0] ++; - $algo = "-maxmatch"; - } - if ( $err[0] > 1 ) { - $tigr->printUsageInfo( ); - print (STDERR "ERROR: Multiple matching algorithms selected\n"); - print (STDERR "Try '$0 -h' for more information.\n"); - exit (1); - } - - #-- Set up the program path names - my $algo_path = "$BIN_DIR/mummer"; - my $mgaps_path = "$BIN_DIR/mgaps"; - my $prepro_path = "$AUX_BIN_DIR/prepro"; - my $postpro_path = "$AUX_BIN_DIR/postpro"; - my $showcoords_path = "$BIN_DIR/show-coords"; - - #-- Check that the files needed are all there and readable/writable - { - undef (@err); - if ( !$tigr->isExecutableFile ($algo_path) ) { - push (@err, $algo_path); - } - - if ( !$tigr->isExecutableFile ($mgaps_path) ) { - push (@err, $mgaps_path); - } - - if ( !$tigr->isExecutableFile ($prepro_path) ) { - push (@err, $prepro_path); - } - - if ( !$tigr->isExecutableFile ($postpro_path) ) { - push (@err, $postpro_path); - } - - if ( !$tigr->isReadableFile ($ref_file) ) { - push (@err, $ref_file); - } - - if ( !$tigr->isReadableFile ($qry_file) ) { - push (@err, $qry_file); - } - - if ( !$tigr->isCreatableFile ("$pfx.aaref") ) { - if ( !$tigr->isWritableFile ("$pfx.aaref") ) { - push (@err, "$pfx.aaref"); - } - } - - if ( !$tigr->isCreatableFile ("$pfx.aaqry") ) { - if ( !$tigr->isWritableFile ("$pfx.aaqry") ) { - push (@err, "$pfx.aaqry"); - } - } - - if ( !$tigr->isCreatableFile ("$pfx.mgaps") ) { - if ( !$tigr->isWritableFile ("$pfx.mgaps") ) { - push (@err, "$pfx.mgaps"); - } - } - - if ( !$tigr->isCreatableFile ("$pfx.delta") ) { - if ( !$tigr->isWritableFile ("$pfx.delta") ) { - push (@err, "$pfx.delta"); - } - } - - if ( $generate_coords ) { - if ( !$tigr->isExecutableFile ($showcoords_path) ) { - push (@err, $showcoords_path); - } - if ( !$tigr->isCreatableFile ("$pfx.coords") ) { - if ( !$tigr->isWritableFile ("$pfx.coords") ) { - push (@err, "$pfx.coords"); - } - } - } - - #-- If 1 or more files could not be processed, terminate script - if ( $#err >= 0 ) { - $tigr->logError - ("ERROR: The following critical files could not be used", 1); - while ( $#err >= 0 ) { - $tigr->logError (pop(@err), 1); - } - $tigr->logError - ("Check your paths and file permissions and try again", 1); - $tigr->bail( ); - } - } - - - #-- Run prepro -r and -q and assert return value is zero - print (STDERR "1: PREPARING DATA\n"); - $err[0] = $tigr->runCommand - ("$prepro_path -m $mask -r $ref_file > $pfx.aaref"); - - if ( $err[0] != 0 ) { - $tigr->bail - ("ERROR: prepro -r returned non-zero\n"); - } - - $err[0] = $tigr->runCommand - ("$prepro_path -m $mask -q $qry_file > $pfx.aaqry"); - - if ( $err[0] != 0 ) { - $tigr->bail ("ERROR: prepro -q returned non-zero\n"); - } - - - #-- Run mummer | mgaps and assert return value is zero - print (STDERR "2,3: RUNNING mummer AND CREATING CLUSTERS\n"); - open(ALGO_PIPE, "$algo_path $algo -l $size $pfx.aaref $pfx.aaqry |") - or $tigr->bail ("ERROR: could not open $algo_path output pipe $!"); - open(CLUS_PIPE, "| $mgaps_path -l $clus -s $gap -f $diff > $pfx.mgaps") - or $tigr->bail ("ERROR: could not open $mgaps_path input pipe $!"); - while ( ) { - print CLUS_PIPE - or $tigr->bail ("ERROR: could not write to $mgaps_path pipe $!"); - } - $err[0] = close(ALGO_PIPE); - $err[1] = close(CLUS_PIPE); - - if ( $err[0] == 0 || $err[1] == 0 ) { - $tigr->bail ("ERROR: mummer and/or mgaps returned non-zero\n"); - } - - - #-- Run postpro and assert return value is zero - print (STDERR "4: FINISHING DATA\n"); - $err[0] = $tigr->runCommand - ("$postpro_path $psw -x $blsm -b $blen ". - "$ref_file $qry_file $pfx < $pfx.mgaps"); - - if ( $err[0] != 0 ) { - $tigr->bail ("ERROR: postpro returned non-zero\n"); - } - - #-- If the -o flag was set, run show-coords using PROmer1.1 settings - if ( $generate_coords ) { - print (STDERR "5: GENERATING COORDS FILE\n"); - $err[0] = $tigr->runCommand - ("$showcoords_path -r $pfx.delta > $pfx.coords"); - - if ( $err[0] != 0 ) { - $tigr->bail ("ERROR: show-coords returned non-zero\n"); - } - } - - #-- Remove the temporary output - $err[0] = unlink ("$pfx.aaref", "$pfx.aaqry", "$pfx.mgaps"); - - if ( $err[0] != 3 ) { - $tigr->logError ("WARNING: there was a problem deleting". - " the temporary output files", 1); - } - - #-- Return success - return (0); -} - -exit ( main ( ) ); - -#-- END OF SCRIPT diff --git a/tools/MUMmer3.23/repeat-match b/tools/MUMmer3.23/repeat-match deleted file mode 100755 index 9316a6e..0000000 Binary files a/tools/MUMmer3.23/repeat-match and /dev/null differ diff --git a/tools/MUMmer3.23/run-mummer1 b/tools/MUMmer3.23/run-mummer1 deleted file mode 100755 index 084a5e5..0000000 --- a/tools/MUMmer3.23/run-mummer1 +++ /dev/null @@ -1,26 +0,0 @@ -#!/bin/csh -f -# -# **SEVERELY** antiquated script for running the mummer 1 suite -# -r option reverse complements the query sequence, coordinates of the reverse -# matches will be relative to the reversed sequence -# - -set ref = $1 -set qry = $2 -set pfx = $3 -set rev = $4 - -set bindir = /export/home/zqhu/tools/MUMmer3.23 - -if($ref == '' || $qry == '' || $pfx == '') then - echo "USAGE: $0 [-r]" - exit(-1) -endif - -echo "Find MUMs" -$bindir/mummer -mum -l 20 $rev $ref $qry | tail +2 > $pfx.out -echo "Determine gaps" -$bindir/gaps $ref $rev < $pfx.out > $pfx.gaps -echo "Align gaps" -$bindir/annotate $pfx.gaps $qry > $pfx.align -mv witherrors.gaps $pfx.errorsgaps diff --git a/tools/MUMmer3.23/run-mummer3 b/tools/MUMmer3.23/run-mummer3 deleted file mode 100755 index a80ce95..0000000 --- a/tools/MUMmer3.23/run-mummer3 +++ /dev/null @@ -1,28 +0,0 @@ -#!/bin/csh -f -# -# for running the basic mummer 3 suite, should use nucmer instead when possible -# to avoid the confusing reverse coordinate system of the raw programs. -# -# NOTE: be warned that all reverse matches will then -# be relative to the reverse complement of the query sequence. -# -# Edit this script as necessary to alter the matching and clustering values -# - -set ref = $1 -set qry = $2 -set pfx = $3 - -set bindir = /export/home/zqhu/tools/MUMmer3.23 - -if($ref == '' || $qry == '' || $pfx == '') then - echo "USAGE: $0 " - exit(-1) -endif - -echo "Find MUMs" -$bindir/mummer -mumreference -b -l 20 $ref $qry > $pfx.out -echo "Determine gaps" -$bindir/mgaps -l 100 -f .12 -s 600 < $pfx.out > $pfx.gaps -echo "Align gaps" -$bindir/combineMUMs -x -e .10 -W $pfx.errorsgaps $ref $qry $pfx.gaps > $pfx.align diff --git a/tools/MUMmer3.23/show-aligns b/tools/MUMmer3.23/show-aligns deleted file mode 100755 index fbb9c3d..0000000 Binary files a/tools/MUMmer3.23/show-aligns and /dev/null differ diff --git a/tools/MUMmer3.23/show-coords b/tools/MUMmer3.23/show-coords deleted file mode 100755 index 018de94..0000000 Binary files a/tools/MUMmer3.23/show-coords and /dev/null differ diff --git a/tools/MUMmer3.23/show-diff b/tools/MUMmer3.23/show-diff deleted file mode 100755 index cbb6632..0000000 Binary files a/tools/MUMmer3.23/show-diff and /dev/null differ diff --git a/tools/MUMmer3.23/show-snps b/tools/MUMmer3.23/show-snps deleted file mode 100755 index db1d725..0000000 Binary files a/tools/MUMmer3.23/show-snps and /dev/null differ diff --git a/tools/MUMmer3.23/show-tiling b/tools/MUMmer3.23/show-tiling deleted file mode 100755 index f87db3f..0000000 Binary files a/tools/MUMmer3.23/show-tiling and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/libbasedir/cleanMUMcand.o b/tools/MUMmer3.23/src/kurtz/libbasedir/cleanMUMcand.o deleted file mode 100644 index 1addc7a..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/libbasedir/cleanMUMcand.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/libbasedir/clock.o b/tools/MUMmer3.23/src/kurtz/libbasedir/clock.o deleted file mode 100644 index 0c6e980..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/libbasedir/clock.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/libbasedir/libbase.a b/tools/MUMmer3.23/src/kurtz/libbasedir/libbase.a deleted file mode 100644 index c5d42f1..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/libbasedir/libbase.a and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/libbasedir/mapfile.o b/tools/MUMmer3.23/src/kurtz/libbasedir/mapfile.o deleted file mode 100644 index d59f044..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/libbasedir/mapfile.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/libbasedir/multiseq.o b/tools/MUMmer3.23/src/kurtz/libbasedir/multiseq.o deleted file mode 100644 index 762cc9c..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/libbasedir/multiseq.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/libbasedir/procopt.o b/tools/MUMmer3.23/src/kurtz/libbasedir/procopt.o deleted file mode 100644 index d9c5bb3..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/libbasedir/procopt.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/libbasedir/safescpy.o b/tools/MUMmer3.23/src/kurtz/libbasedir/safescpy.o deleted file mode 100644 index 914233a..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/libbasedir/safescpy.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/libbasedir/seterror.o b/tools/MUMmer3.23/src/kurtz/libbasedir/seterror.o deleted file mode 100644 index 177e6af..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/libbasedir/seterror.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/libbasedir/space.o b/tools/MUMmer3.23/src/kurtz/libbasedir/space.o deleted file mode 100644 index 8c30f28..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/libbasedir/space.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/mm3src/findmaxmat.o b/tools/MUMmer3.23/src/kurtz/mm3src/findmaxmat.o deleted file mode 100644 index e4bb28e..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/mm3src/findmaxmat.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/mm3src/findmumcand.o b/tools/MUMmer3.23/src/kurtz/mm3src/findmumcand.o deleted file mode 100644 index 58584a0..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/mm3src/findmumcand.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/mm3src/maxmat3.o b/tools/MUMmer3.23/src/kurtz/mm3src/maxmat3.o deleted file mode 100644 index ae40a53..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/mm3src/maxmat3.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/mm3src/maxmatinp.o b/tools/MUMmer3.23/src/kurtz/mm3src/maxmatinp.o deleted file mode 100644 index 9fb944f..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/mm3src/maxmatinp.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/mm3src/maxmatopt.o b/tools/MUMmer3.23/src/kurtz/mm3src/maxmatopt.o deleted file mode 100644 index 5dbd5fe..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/mm3src/maxmatopt.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/mm3src/procmaxmat.o b/tools/MUMmer3.23/src/kurtz/mm3src/procmaxmat.o deleted file mode 100644 index 4afe078..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/mm3src/procmaxmat.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/access.o b/tools/MUMmer3.23/src/kurtz/streesrc/access.o deleted file mode 100644 index b244962..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/access.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/addleafcount.o b/tools/MUMmer3.23/src/kurtz/streesrc/addleafcount.o deleted file mode 100644 index 0a5b683..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/addleafcount.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/construct.o b/tools/MUMmer3.23/src/kurtz/streesrc/construct.o deleted file mode 100644 index 9cc1853..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/construct.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/depthtab.o b/tools/MUMmer3.23/src/kurtz/streesrc/depthtab.o deleted file mode 100644 index 24515e7..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/depthtab.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/dfs.o b/tools/MUMmer3.23/src/kurtz/streesrc/dfs.o deleted file mode 100644 index 43d0e78..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/dfs.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/ex2leav.o b/tools/MUMmer3.23/src/kurtz/streesrc/ex2leav.o deleted file mode 100644 index df3fb75..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/ex2leav.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/iterator.o b/tools/MUMmer3.23/src/kurtz/streesrc/iterator.o deleted file mode 100644 index 2035615..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/iterator.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/libstree.a b/tools/MUMmer3.23/src/kurtz/streesrc/libstree.a deleted file mode 100644 index 257e945..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/libstree.a and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/linkloc.o b/tools/MUMmer3.23/src/kurtz/streesrc/linkloc.o deleted file mode 100644 index 03685fa..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/linkloc.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/overmax.o b/tools/MUMmer3.23/src/kurtz/streesrc/overmax.o deleted file mode 100644 index 6e83182..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/overmax.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/oversucc.o b/tools/MUMmer3.23/src/kurtz/streesrc/oversucc.o deleted file mode 100644 index e07c345..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/oversucc.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/kurtz/streesrc/scanpref.o b/tools/MUMmer3.23/src/kurtz/streesrc/scanpref.o deleted file mode 100644 index 2012203..0000000 Binary files a/tools/MUMmer3.23/src/kurtz/streesrc/scanpref.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/tigr/delta.o b/tools/MUMmer3.23/src/tigr/delta.o deleted file mode 100644 index c2668db..0000000 Binary files a/tools/MUMmer3.23/src/tigr/delta.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/tigr/sw_align.o b/tools/MUMmer3.23/src/tigr/sw_align.o deleted file mode 100644 index 60354ef..0000000 Binary files a/tools/MUMmer3.23/src/tigr/sw_align.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/tigr/tigrinc.o b/tools/MUMmer3.23/src/tigr/tigrinc.o deleted file mode 100644 index 85b90c7..0000000 Binary files a/tools/MUMmer3.23/src/tigr/tigrinc.o and /dev/null differ diff --git a/tools/MUMmer3.23/src/tigr/translate.o b/tools/MUMmer3.23/src/tigr/translate.o deleted file mode 100644 index 8b1ff05..0000000 Binary files a/tools/MUMmer3.23/src/tigr/translate.o and /dev/null differ diff --git a/tools/augustus.2.5.5/src/LICENCE.TXT b/tools/augustus.2.5.5/src/LICENCE.TXT index b3e65c7..6ddf106 100644 --- a/tools/augustus.2.5.5/src/LICENCE.TXT +++ b/tools/augustus.2.5.5/src/LICENCE.TXT @@ -1,72 +1,91 @@ -Artistic License 2.0 (http://www.opensource.org/licenses/artistic-license-2.0.php) - -Preamble - -This license establishes the terms under which a given free software Package may be copied, modified, distributed, and/or redistributed. The intent is that the Copyright Holder maintains some artistic control over the development of that Package while still keeping the Package available as open source and free software. - -You are always permitted to make arrangements wholly outside of this license directly with the Copyright Holder of a given Package. If the terms of this license do not permit the full use that you propose to make of the Package, you should contact the Copyright Holder and seek a different licensing arrangement. - -Definitions - -"Copyright Holder" means the individual(s) or organization(s) named in the copyright notice for the entire Package. - -"Contributor" means any party that has contributed code or other material to the Package, in accordance with the Copyright Holder's procedures. - -"You" and "your" means any person who would like to copy, distribute, or modify the Package. - -"Package" means the collection of files distributed by the Copyright Holder, and derivatives of that collection and/or of those files. A given Package may consist of either the Standard Version, or a Modified Version. - -"Distribute" means providing a copy of the Package or making it accessible to anyone else, or in the case of a company or organization, to others outside of your company or organization. - -"Distributor Fee" means any fee that you charge for Distributing this Package or providing support for this Package to another party. It does not mean licensing fees. - -"Standard Version" refers to the Package if it has not been modified, or has been modified only in ways explicitly requested by the Copyright Holder. - -"Modified Version" means the Package, if it has been changed, and such changes were not explicitly requested by the Copyright Holder. - -"Original License" means this Artistic License as Distributed with the Standard Version of the Package, in its current version or as it may be modified by The Perl Foundation in the future. - -"Source" form means the source code, documentation source, and configuration files for the Package. - -"Compiled" form means the compiled bytecode, object code, binary, or any other form resulting from mechanical transformation or translation of the Source form. -Permission for Use and Modification Without Distribution - -(1) You are permitted to use the Standard Version and create and use Modified Versions for any purpose without restriction, provided that you do not Distribute the Modified Version. -Permissions for Redistribution of the Standard Version - -(2) You may Distribute verbatim copies of the Source form of the Standard Version of this Package in any medium without restriction, either gratis or for a Distributor Fee, provided that you duplicate all of the original copyright notices and associated disclaimers. At your discretion, such verbatim copies may or may not include a Compiled form of the Package. - -(3) You may apply any bug fixes, portability changes, and other modifications made available from the Copyright Holder. The resulting Package will still be considered the Standard Version, and as such will be subject to the Original License. -Distribution of Modified Versions of the Package as Source - -(4) You may Distribute your Modified Version as Source (either gratis or for a Distributor Fee, and with or without a Compiled form of the Modified Version) provided that you clearly document how it differs from the Standard Version, including, but not limited to, documenting any non-standard features, executables, or modules, and provided that you do at least ONE of the following: - -(a) make the Modified Version available to the Copyright Holder of the Standard Version, under the Original License, so that the Copyright Holder may include your modifications in the Standard Version. -(b) ensure that installation of your Modified Version does not prevent the user installing or running the Standard Version. In addition, the Modified Version must bear a name that is different from the name of the Standard Version. -(c) allow anyone who receives a copy of the Modified Version to make the Source form of the Modified Version available to others under -(i) the Original License or -(ii) a license that permits the licensee to freely copy, modify and redistribute the Modified Version using the same licensing terms that apply to the copy that the licensee received, and requires that the Source form of the Modified Version, and of any works derived from it, be made freely available in that license fees are prohibited but Distributor Fees are allowed. -Distribution of Compiled Forms of the Standard Version or Modified Versions without the Source - -(5) You may Distribute Compiled forms of the Standard Version without the Source, provided that you include complete instructions on how to get the Source of the Standard Version. Such instructions must be valid at the time of your distribution. If these instructions, at any time while you are carrying out such distribution, become invalid, you must provide new instructions on demand or cease further distribution. If you provide valid instructions or cease distribution within thirty days after you become aware that the instructions are invalid, then you do not forfeit any of your rights under this license. - -(6) You may Distribute a Modified Version in Compiled form without the Source, provided that you comply with Section 4 with respect to the Source of the Modified Version. -Aggregating or Linking the Package - -(7) You may aggregate the Package (either the Standard Version or Modified Version) with other packages and Distribute the resulting aggregation provided that you do not charge a licensing fee for the Package. Distributor Fees are permitted, and licensing fees for other components in the aggregation are permitted. The terms of this license apply to the use and Distribution of the Standard or Modified Versions as included in the aggregation. - -(8) You are permitted to link Modified and Standard Versions with other works, to embed the Package in a larger work of your own, or to build stand-alone binary or bytecode versions of applications that include the Package, and Distribute the result without restriction, provided the result does not expose a direct interface to the Package. -Items That are Not Considered Part of a Modified Version - -(9) Works (including, but not limited to, modules and scripts) that merely extend or make use of the Package, do not, by themselves, cause the Package to be a Modified Version. In addition, such works are not considered parts of the Package itself, and are not subject to the terms of this license. -General Provisions - -(10) Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license. - -(11) If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license. - -(12) This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder. - -(13) This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed. - -(14) Disclaimer of Warranty: The package is provided by the copyright holder and contributors 'as is' and without any express or implied warranties. The implied warranties of merchantability, fitness for a particular purpose, or non-infringement are disclaimed to the extent permitted by your local law. Unless required by law, no copyright holder or contributor will be liable for any direct, indirect, incidental, or consequential damages arising in any way out of the use of the package, even if advised of the possibility of such damage. \ No newline at end of file +Artistic Licence (http://www.opensource.org/licenses/artistic-license.php) + +Preamble + +The intent of this document is to state the conditions under which a Package +may be copied, such that the Copyright Holder maintains some semblance of +artistic control over the development of the package, while giving the users +of the package the right to use and distribute the Package in a more-or-less +customary fashion, plus the right to make reasonable modifications. + +Definitions: +"Package" refers to the collection of files distributed by the Copyright Holder, +and derivatives of that collection of files created through textual modification. +"Standard Version" refers to such a Package if it has not been modified, or has +been modified in accordance with the wishes of the Copyright Holder. +"Copyright Holder" is whoever is named in the copyright or copyrights for the package. +"You" is you, if you're thinking about copying or distributing this Package. +"Reasonable copying fee" is whatever you can justify on the basis of media cost, +duplication charges, time of people involved, and so on. (You will not be required +to justify it to the Copyright Holder, but only to the computing community at large +as a market that must bear the fee.) +"Freely Available" means that no fee is charged for the item itself, though there +may be fees involved in handling the item. It also means that recipients of the +item may redistribute it under the same conditions they received it. + +1. You may make and give away verbatim copies of the source form of the Standard +Version of this Package without restriction, provided that you duplicate all of +the original copyright notices and associated disclaimers. + +2. You may apply bug fixes, portability fixes and other modifications derived +from the Public Domain or from the Copyright Holder. A Package modified in such +a way shall still be considered the Standard Version. + +3. You may otherwise modify your copy of this Package in any way, provided that +you insert a prominent notice in each changed file stating how and when you changed +that file, and provided that you do at least ONE of the following: + +a) place your modifications in the Public Domain or otherwise make them Freely +Available, such as by posting said modifications to Usenet or an equivalent medium, +or placing the modifications on a major archive site such as ftp.uu.net, or by +allowing the Copyright Holder to include your modifications in the Standard Version +of the Package. + +b) use the modified Package only within your corporation or organization. + +c) rename any non-standard executables so the names do not conflict with standard +executables, which must also be provided, and provide a separate manual page for +each non-standard executable that clearly documents how it differs from the Standard +Version. + +d) make other distribution arrangements with the Copyright Holder. + +4. You may distribute the programs of this Package in object code or executable +form, provided that you do at least ONE of the following: + +a) distribute a Standard Version of the executables and library files, together +with instructions (in the manual page or equivalent) on where to get the Standard Version. + +b) accompany the distribution with the machine-readable source of the Package with +your modifications. + +c) accompany any non-standard executables with their corresponding Standard Version +executables, giving the non-standard executables non-standard names, and clearly +documenting the differences in manual pages (or equivalent), together with instructions +on where to get the Standard Version. + +d) make other distribution arrangements with the Copyright Holder. + +5. You may charge a reasonable copying fee for any distribution of this Package. +You may charge any fee you choose for support of this Package. You may not charge +a fee for this Package itself. However, you may distribute this Package in aggregate +with other (possibly commercial) programs as part of a larger (possibly commercial) +software distribution provided that you do not advertise this Package as a product +of your own. + +6. The scripts and library files supplied as input to or produced as output from the +programs of this Package do not automatically fall under the copyright of this Package, +but belong to whomever generated them, and may be sold commercially, and may be +aggregated with this Package. + +7. C or perl subroutines supplied by you and linked into this Package shall not be +considered part of this Package. + +8. The name of the Copyright Holder may not be used to endorse or promote products +derived from this software without specific prior written permission. + +9. THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, +INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS +FOR A PARTICULAR PURPOSE. + +The End