Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
1971219
Update HUPANassem.pm
zhqduan Aug 9, 2019
0759187
Update HUPANassemLSF.pm
zhqduan Aug 9, 2019
c299bc5
Update HUPANassemSLURM.pm
zhqduan Aug 9, 2019
a49eae8
Change
Sep 28, 2019
3cec12a
Update Makefile
zhqduan Oct 10, 2019
f6da1d1
Update Makefile
zhqduan Oct 10, 2019
522b75a
Update Makefile
zhqduan Oct 10, 2019
0ced593
Update README.md
zhqduan Oct 10, 2019
f424acf
Update Makefile
zhqduan Oct 10, 2019
9385b2c
Update README.md
zhqduan Nov 5, 2019
74c7112
Update HUPANassem.pm
zhqduan Nov 7, 2019
4695ffa
Update getUnalnCtg.pl
zhqduan Nov 14, 2019
c2a4aaa
Update README.md
zhqduan Nov 14, 2019
f688d80
Update README.md
zhqduan Nov 14, 2019
c3fbdd1
Update README.md
zhqduan Nov 14, 2019
d3ff770
Update README.md
zhqduan Nov 14, 2019
a72e724
Update README.md
zhqduan Nov 14, 2019
fb225b6
Update README.md
zhqduan Nov 14, 2019
a6ecb53
Update HUPANgeneExist.pm
zhqduan Nov 14, 2019
e21ac4f
Update HUPANgeneExistLSF.pm
zhqduan Nov 14, 2019
82d3bec
Update getTaxClass.pl
zhqduan Nov 22, 2019
6958bfd
Update HUPANassem.pm
zhqduan Jan 11, 2020
8ec1d07
Update HUPANassemLSF.pm
zhqduan Jan 11, 2020
a4e223a
Update HUPANassemSLURM.pm
zhqduan Jan 11, 2020
f4ceac1
Update HUPANrmContaminate.pm
zhqduan Apr 9, 2020
bea4056
Update HUPANrmContaminateLSF.pm
zhqduan Apr 9, 2020
1fa3239
Update HUPANrmContaminateSLURM.pm
zhqduan Apr 9, 2020
5b40826
Update HUPANsplitSeq.pm
zhqduan Apr 9, 2020
0d2abc5
Update HUPANsplitSeqLSF.pm
zhqduan Apr 9, 2020
e73beed
Update LICENCE.TXT
zhqduan Apr 9, 2020
ee25981
Fix some known bugs
zhqduan Apr 9, 2020
dac7f7e
Update README.md
zhqduan Apr 9, 2020
9a1face
Update README.md
zhqduan Apr 9, 2020
05beb05
Update hupan_cmd.sh
zhqduan Apr 9, 2020
5c81e6b
Update hupan_cmd.sh
zhqduan Apr 9, 2020
c1a0d78
Update .DS_Store
zhqduan Apr 10, 2020
7ab5aac
update
zhqduan Apr 10, 2020
787bccf
Update README.md
zhqduan Apr 15, 2020
70b2685
Update README.md
zhqduan Apr 16, 2020
ff6cdfe
Update README.md
zhqduan Apr 17, 2020
67bfd04
update
zhqduan Apr 25, 2020
6873030
Update HUPANqualSta.pm
zhqduan Apr 26, 2020
b9d3fa7
update
zhqduan Apr 27, 2020
85fc61f
update
zhqduan Apr 27, 2020
bafdfbc
Update HUPANqualSta.pm
zhqduan Apr 30, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
64 changes: 40 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
#HUPAN
# HUPAN: HUman Pan-genome ANalysis

---

**1. Introduction**

The human reference genome is still incomplete, especially for those population-specific or individual-specific regions, which may have important functions. It encourages us to build the pan-genome of human population. Previously, our team developed a "map-to-pan" strategy, [EUPAN][1], specific for eukaryotic pan-genome analysis. However, due to the large genome size of individual human genome, [EUPAN][2] is not suit for pan-genome analysis involving in hundreds of individual genomes. Here, we present an improved tool, HUPAN (Human Pan-genome Analysis), for human pan-genome analysis.
The human reference genome is still incomplete, especially for those population-specific or individual-specific regions, which may have important functions. It encourages us to build the pan-genome of human population. Previously, our team developed a "map-to-pan" strategy, [EUPAN][1], specific for eukaryotic pan-genome analysis. However, due to the large genome size of individual human genome, [EUPAN][2] is not suit for pan-genome analysis involving in hundreds of individual genomes. Here, we present an improved tool, HUPAN (HUman Pan-genome ANalysis), for human pan-genome analysis.

The HUPAN homepage is http://cgm.sjtu.edu.cn/hupan/

The HUPAN paper is available at https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1751-y

**2. Installation**

Expand All @@ -15,9 +19,11 @@ The human reference genome is still incomplete, especially for those population-
R is utilized for visualization and statistical tests in HUPAN
toolbox. Please install R first and make sure R and Rscript are
under your PATH.
- R packages Several R packages are needed including ggplot2, reshape2
and ape packages. Follow the Installation step,
- or you can install the packages by yourself.

- R packages

Several R packages are needed including ggplot2, reshape2
and ape packages. Follow the Installation step, or you can install the packages by yourself.

**Installation procedures**

Expand All @@ -26,9 +32,7 @@ The human reference genome is still incomplete, especially for those population-
`git clone git@github.com:SJTU-CGM/HUPAN.git`

- Alternatively, you also could obtain the toolbox in the [HUPAN][4]
website;

- Please uncompress the HUPAN toolbox package:
website and uncompress the HUPAN toolbox package:

`tar zxvf HUPAN-v**.tar.gz`

Expand Down Expand Up @@ -133,7 +137,8 @@ ii. If the reads are not so good, the users could trim or filter low-quality rea
hupanSLURM trim -w 100 -m 100 data/ filter/ /path/to/Trimmomatic

Results could be found in the trim or filter directory.
iii.After trimming or filtration of reads, the sequencing quality should be evaluated again by `qualitySta`, and if the trimming results are still not good for subsequent analyses, new parameters should be given and the above steps should be conducted for several times.

iii. After trimming or filtration of reads, the sequencing quality should be evaluated again by `qualitySta`, and if the trimming results are still not good for subsequent analyses, new parameters should be given and the above steps should be conducted for several times.

**(3) *De novo* assembly of individual genomes**

Expand All @@ -147,7 +152,7 @@ Please note that this startegy requires huge memory for assembly an individual h

ii.Assembly by the iterative use of SOAPDenovo2. Not Recommend.

hupanSLURM linearK data assembly_linearK/ /path/to/SOAPDenovo2
hupanSLURM assemble linearK data assembly_linearK/ /path/to/SOAPDenovo2

iii. Assembly by [SGA][11].

Expand Down Expand Up @@ -176,10 +181,15 @@ iv. Two types of non-reference sequences, fully unaligned sequences and partiall
v. Non-reference sequences from multiple individuals are merged:

hupanSLURM mergeUnalnCtg Unalign_result/data/ mergeUnalnCtg_result

Alternatively, if you conducted step iv by `hupan`, you can find the merged result in the Unalign_result/total:

mv Unalign_result/total/ mergeUnalnCtg_result


**(5) Remove redundancy and potential contamination sequences**

**(5) Remove redundancy and potential commination sequences**

After obtaining the non-reference sequences from multiple individuals, redundant sequences between different individuals should be excluded, and the potential commination sequences from non-human species are also removed for further analysis.
After obtaining the non-reference sequences from multiple individuals, redundant sequences between different individuals should be excluded, and the potential contamination sequences from non-human species are also removed for further analysis.

i. The step of remove redundancy sequences is conducted by [CDHIT][14] for fully unaligned sequences and partially unaligned sequences, respectively:

Expand All @@ -188,17 +198,24 @@ i. The step of remove redundancy sequences is conducted by [CDHIT][14] for fully

ii. Then the non-redundant sequences are aligned to NCBI’s non-redundant nucleotide database by [BLAST][15]:

hupanSLURM blastAlign blast rmRedundant rmRedundant_blast /path/to/nt /path/to/blast
mkdir nt & cd nt
wget https://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz |gunzip & cd ..
hupanSLURM blastAlign mkblastdb nt nt_index path/to/blast
mkdir rmRedundant & mv rmRedundant.fully.unaligned rmRedundant & mv rmRedundant.partially.unaligned rmRedundant
hupanSLURM blastAlign blast rmRedundant rmRedundant_blast /path/to/nt_index /path/to/blast

iii. According to the alignment result, the taxonomic classification of each sequences (if have) could be obtained:

hupanSLURM getTaxClass rmRedundant_blast/ data/fully/fully.non-redundant.blast info/ TaxClass_fully
hupanSLURM getTaxClass rmRedundant_blast/ data/partially/partially.non-redundant.blast info/ TaxClass_partially
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz & tar -zvxf new_taxdump.tar.gz
mkdir info & mv nucl_gb.accession2taxid info & mv new_taxdump/rankedlineage.dmp info
hupanSLURM getTaxClass rmRedundant_blast/data/rmRedundant.fully.unaligned/non-redundant.blast info/ TaxClass_fully
hupanSLURM getTaxClass rmRedundant_blast/data/rmRedundant.partially.unaligned/non-redundant.blast info/ TaxClass_partially

iv. And the sequences classifying as microbiology and non-primate eukaryotes are considered as non-human sequences and removed from further consideration:

hupanSLURM rmCtm -i 60 rmRedundant/fully/fully.non-redundant.fa rmRedundant_blast/data/fully/fully.non-redundant.blast TaxClass_fully/data/accession.name rmCtm_fully
hupanSLURM rmCtm -i 60 rmRedundant/partially/partially.non-redundant.fa rmRedundant_blast/data/partially/partially.non-redundant.blast TaxClass_partially/data/accession.name rmCtm_partially
hupanSLURM rmCtm -i 60 rmRedundant/rmRedundant.fully.unaligned/non-redundant.fa rmRedundant_blast/data/rmRedundant.fully.unaligned/non-redundant.blast TaxClass_fully/data/accession.name rmCtm_fully
hupanSLURM rmCtm -i 60 rmRedundant/rmRedundant.partially.unaligned/non-redundant.fa rmRedundant_blast/data/rmRedundant.partially.unaligned/non-redundant.blast TaxClass_partially/data/accession.name rmCtm_partially

**(6) Construction and annotation of pan-genome**

Expand All @@ -219,13 +236,12 @@ iii. Then after all procedures are finished, the outcomes are merged:

iv. The new predicted genes may be highly similar to the genes that are located in reference genome, and additional filtering step should be conducted to ensure the novelty of predicted gene:

hupanSLURM filterNovGen GenePre_merge GenePre_filter /path/to/reference/ /path/to/blast /path/to/cdhit /path/to/RepeatMask
hupanSLURM filterNovGene GenePre_merge GenePre_filter /path/to/reference/ /path/to/blast /path/to/cdhit /path/to/RepeatMask

v. The annotation of pan-genome sequences is simply merged to obtain by combine two annotation files:


hupanSLURM pTpG ref/ref.gtf ref/ref-ptpg.gtf
cat ref/ref-ptpg.gtf non-reference.gtf >pan/pan.gtf
hupanSLURM pTpG ref/ref.gff ref/ref-ptpg.gff
cat ref/ref-ptpg.gff non-reference.gtf >pan/pan.gff

**(7) PAV analysis**

Expand All @@ -243,7 +259,7 @@ ii. The result of .sam should be converted to .bam and sorted and indexed use [S

iii. Then the gene body coverage and the cds coverage of each gene are calculated:

hupanSLURM geneCov panBam/data geneCov/ pan/pan.gtf
hupanSLURM geneCov panBam/data geneCov/ pan/pan.gff

iv. Finally, the gene presence-absence is determined by the threshold of cds coverage as 95%:

Expand All @@ -263,7 +279,7 @@ Any bugs or suggestions, please contact the [authors][20].
[3]: https://github.com/SJTU-CGM/HUPAN
[4]: http://cgm.sjtu.edu.cn/hupan/download.php
[5]: http://cgm.sjtu.edu.cn/eupan/
[6]: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/HG001.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam
[6]: http://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/HG001.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam
[7]: http://cgm.sjtu.edu.cn/hupan/data/hupanExample.tar.gz
[8]: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
[9]: http://www.usadellab.org/cms/index.php?page=trimmomatic
Expand Down
7 changes: 3 additions & 4 deletions hupan_cmd.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
#!/bin/bash

IFS=' '
complete -W "qualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta getUnalnCtg rmRedundant pTpG geneCov geneExist subSample gFamExist bam2bed fastaSta sim rmCtm blastAlign simSeq splitSeq getTaxClass genePre mergeNovGene" eupan
complete -W "qualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta getUnalnCtg rmRedundant pTpG geneCov geneExist subSample gFamExist bam2bed fastaSta sim getTaxClass rmCtm blastAlign simSeq splitSeq genePre mergeNovGene filterNovGene" hupan

complete -W "qualSta mergeQualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta mergeAssemSta getUnalnCtg mergeUnalnCtg rmRedundant pTpG geneCov mergeGeneCov geneExist subSample gFamExist bam2bed fastaSta sim rmCtm blastAlign simSeq splitSeq getTaxClass genePre mergeNovGene" eupanLSF

complete -W "qualSta mergeQualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta mergeAssemSta getUnalnCtg mergeUnalnCtg rmRedundant pTpG geneCov mergeGeneCov geneExist subSample gFamExist bam2bed fastaSta sim rmCtm blastAlign simSeq splitSeq getTaxClass genePre mergeNovGene" eupanSLURM
complete -W "qualSta mergeQualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta mergeAssemSta getUnalnCtg mergeUnalnCtg rmRedundant pTpG geneCov mergeGeneCov geneExist subSample gFamExist bam2bed fastaSta sim getTaxClass rmCtm blastAlign simSeq splitSeq genePre mergeNovGene filterNovGene" hupanLSF

complete -W "qualSta mergeQualSta trim alignRead sam2bam bamSta assemble alignContig extractSeq assemSta mergeAssemSta getUnalnCtg mergeUnalnCtg rmRedundant pTpG geneCov mergeGeneCov geneExist subSample gFamExist bam2bed fastaSta sim getTaxClass rmCtm blastAlign simSeq splitSeq genePre mergeNovGene filterNovGene" hupanSLURM

Loading