Skip to content

Developers' codebook

Silvio Waschina edited this page Dec 5, 2025 · 6 revisions

Important: The following code blocks are not a documentation and are mainly meant for gapseq developers.

Creating an update of the reference sequence database for Zenodo

Update EC numbers in relevant data files

The IUBMB regularily updates EC numbers if reaction are re-classified. It is important to keep the ECs in the relevant gapseq files up-to-date because reference sequences on UniProt also are updated when ECs are transferred. Thus, if ECs in the gapseq database are not updated, they might not be linked correctly to their reference sequences.

To update the EC numbers in gapseq's data files, run the following:

# Update sequences
cd <gapseq_dir>
Rscript src/update_ecs_via_IUBMB.R

After this EC number update, I recommend updating the sequences (see following section). The publication of the new reference sequence database on Zenodo should ideally be synchronized with running the following git dance to push the EC updates to the gapseq github repository:

git status # check what files have changes
git add . # or select the ones that you want to commit
git commit -m "Update of EC numbers"
git push

Update Sequences

First, delete all "old" data:

rm dat/seq/Bacteria/rev/*.fasta
rm dat/seq/Bacteria/unrev/*.fasta
rm dat/seq/Bacteria/rxn/*.fasta
rm dat/seq/Archaea/rev/*.fasta
rm dat/seq/Archaea/unrev/*.fasta
rm dat/seq/Archaea/rxn/*.fasta

Second, run gapseq find to re-download everything:

# the genome is irrelevant as no blasting is performed ('-x')
gapseq find -p all -t Bacteria -m all -n -x toy/ecoli.faa.gz > bac_update.log 2>&1
gapseq find -p all -t Archaea -m all -n -x toy/ecoli.faa.gz > ar_update.log 2>&1

Note: This sends thousands of queries to uniprot. Some queries fail (mostly timeouts). To fix this, rerun the 'gapseq find' commands from above until no errors are left in the log files.

Third, create all sequence.tar.gz archives (rev/unrev/rxn)

cd dat/seq/Bacteria/rev/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../
cd dat/seq/Bacteria/unrev/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../
cd dat/seq/Bacteria/rxn/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../
cd dat/seq/Archaea/rev/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../
cd dat/seq/Archaea/unrev/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../
cd dat/seq/Archaea/rxn/ && tar -czvf sequences.tar.gz ./*.fasta && cd ../../../../

(4) Create a MD5 checksum table for all tar.gz archives

cd dat/seq/
find -mindepth 2 -type f -name "*.tar.gz" -exec md5sum {} \; > md5sums.txt

(5) create taxon-specific final archive for Zenodo upload

# create taxon-specific final archive for Zenodo upload
tar -czvf Bacteria.tar.gz Bacteria/*/*.tar.gz
tar -czvf Archaea.tar.gz Archaea/*/*.tar.gz

(6) Upload the following files to Zenodo via the web-interface as new version to https://doi.org/10.5281/zenodo.10047603

  • dat/seq/md5sums.txt
  • dat/seq/Bacteria.tar.gz
  • dat/seq/Archaea.tar.gz

Remember to give a new version number!

Updating toy data

From time to time, the toy model files in toy/ should be updated, especially with new gapseq releases.

gapseq doall -b 200 -l 100 -f toy/ -A diamond toy/ecoli.faa.gz
gapseq doall -b 200 -l 100 -f toy/ -A diamond toy/myb71.faa.gz
gzip -f toy/ecoli.xml
gzip -f toy/myb71.xml

cd toy
gapseq test-long
gzip -f ecore.xml
cd ..