-
Notifications
You must be signed in to change notification settings - Fork 40
Developers' codebook
Important: The following code blocks are not a documentation and are mainly meant for gapseq developers.
The IUBMB regularily updates EC numbers if reaction are re-classified. It is important to keep the ECs in the relevant gapseq files up-to-date because reference sequences on UniProt also are updated when ECs are transferred. Thus, if ECs in the gapseq database are not updated, they might not be linked correctly to their reference sequences.
To update the EC numbers in gapseq's data files, run the following:
# Update sequences
cd <gapseq_dir>
Rscript src/update_ecs_via_IUBMB.RAfter this EC number update, I recommend updating the sequences (see following section). The publication of the new reference sequence database on Zenodo should ideally be synchronized with running the following git dance to push the EC updates to the gapseq github repository:
git status # check what files have changes
git add . # or select the ones that you want to commit
git commit -m "Update of EC numbers"
git pushFirst, delete all "old" data:
rm dat/seq/Bacteria/rev/*.fasta
rm dat/seq/Bacteria/unrev/*.fasta
rm dat/seq/Bacteria/rxn/*.fasta
rm dat/seq/Archaea/rev/*.fasta
rm dat/seq/Archaea/unrev/*.fasta
rm dat/seq/Archaea/rxn/*.fastaSecond, run gapseq find to re-download everything:
# the genome is irrelevant as no blasting is performed ('-x')
gapseq find -p all -t Bacteria -m all -n -x toy/ecoli.faa.gz > bac_update.log 2>&1
gapseq find -p all -t Archaea -m all -n -x toy/ecoli.faa.gz > ar_update.log 2>&1Note: This sends thousands of queries to uniprot. Some queries fail (mostly timeouts). To fix this, rerun the 'gapseq find' commands from above until no errors are left in the log files.
Third, create all sequence.tar.gz archives (rev/unrev/rxn)
cd dat/seq/Bacteria/rev/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../
cd dat/seq/Bacteria/unrev/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../
cd dat/seq/Bacteria/rxn/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../
cd dat/seq/Archaea/rev/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../
cd dat/seq/Archaea/unrev/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../
cd dat/seq/Archaea/rxn/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../(4) Create a MD5 checksum table for all tar.gz archives
cd dat/seq/
find -mindepth 2 -type f -name "*.tar.gz" -exec md5sum {} \; > md5sums.txt(5) create taxon-specific final archive for Zenodo upload
# create taxon-specific final archive for Zenodo upload
tar -czvf Bacteria.tar.gz Bacteria/*/*.tar.gz
tar -czvf Archaea.tar.gz Archaea/*/*.tar.gz(6) Upload the following files to Zenodo via the web-interface as new version to https://doi.org/10.5281/zenodo.10047603
dat/seq/md5sums.txtdat/seq/Bacteria.tar.gzdat/seq/Archaea.tar.gz
Remember to give a new version number!
From time to time, the toy model files in toy/ should be updated, especially with new gapseq releases.
gapseq doall -b 200 -l 100 -f toy/ -A diamond toy/ecoli.faa.gz
gapseq doall -b 200 -l 100 -f toy/ -A diamond toy/myb71.faa.gz
gzip -f toy/ecoli.xml
gzip -f toy/myb71.xml
cd toy
gapseq test-long
gzip -f ecore.xml
cd ..