The SPONGE package generates human prior gene regulatory networks and protein-protein interaction networks for the involved transcription factors.
This repository contains the SPONGE package, which allows the generation of human prior gene regulatory networks based mainly on the data from the JASPAR database. It also uses NCBI to find the human analogs of vertebrate transcription factors, UniProt for symbol matching, and STRING to retrieve protein-protein interactions between transcription factors. By default, Ensembl is used to collect all the promoter regions in the human genome as the regions of interest, but different regions can be provided by the user. Because SPONGE accesses these databases on the fly, it requires internet access.
Prior gene regulatory networks are useful mainly as an input for tools that incorporate additional sources of information to refine them. The prior networks generated by SPONGE are designed to be compatible with PANDA and related NetZoo tools.
The purpose of this project is to give the ability to generate prior gene regulatory networks to people who do not have the knowledge or inclination to do the genome-wide motif search, but would still like to change some parameters that were used to generate publicly available prior gene regulatory networks. It is also designed to facilitate the inclusion of new information from database updates into the prior networks.
If you just want to use the prior networks generated by the stable version of SPONGE with the default settings, they are available on Zenodo.
This repository only contains the SPONGE package. The code used to create figures in the SPONGE manuscript can be found here.
The features already available are:
- Generation of prior gene regulatory network
- Generation of prior protein-protein interaction network for transcription factors
- Automatic download of required files during setup
- Parallelised motif filtering
- Command line interface
The requirements are provided in a requirements.txt file.
SPONGE can be installed via pip:
pip install netzoopy-spongeAlternatively, it can be installed by downloading this repository and then installing with pip (possibly in interactive mode):
git clone https://github.com/ladislav-hovan/sponge.git
cd sponge
pip install -e .SPONGE comes with a netzoopy-sponge command line script:
# Get information about the available options
netzoopy-sponge --help
# Run the pipeline
netzoopy-spongeSPONGE has a lot of options, which can be seen by generating an example config file:
# Create an example config file in the current directory
netzoopy-sponge -eThe defaults are designed to be sensible and the users do not have to change any of them unless desired.
Within Python, the default workflow can be invoked as follows:
# Import the class definition
from sponge.sponge import Sponge
# Run the default workflow
# Will create a temporary folder in the current directory
sponge_obj = Sponge()Much like the command line script, the Sponge class accepts a lot of
options for the configuration, which can be specified through a path
to a config file or a dictionary with the options.
For more information, you can run help(Sponge) after the import.
In case one needs more control over the individual steps, the workflow in Python would be as follows:
# Import the class definition
from sponge.sponge import Sponge
# Create the SPONGE object
# The default workflow option can also just be specified in the config
sponge_obj = Sponge(
config=path_to_config_file,
config_update={'default_workflow': False},
)
# Select the appropriate transcription factors from JASPAR
sponge_obj.select_motifs()
# Filter the TF binding sites of the JASPAR bigbed file to the ones
# in the defined regions of interest
sponge_obj.filter_tfbs()
# Retrieve the protein-protein interactions between the transcription
# factors from the STRING database
sponge_obj.retrieve_ppi()
# Write the motif and PPI priors to their respective files
sponge_obj.write_output_files()At each step, there is an option to tweak the settings provided in the
initial configuration, either through keyword arguments or using the
user_config_update argument.
We would urge caution when using this setting though, as this can make
the settings inconsistent between different steps.
The final set of settings used will be saved in the temporary directory
after the SPONGE object is deleted.
SPONGE will attempt to download the files it needs into a temporary
directory (.sponge_temp by default).
Paths can be provided if these files were downloaded in advance.
The JASPAR bigbed file required for filtering is huge (> 100 GB), so
the download might take some time.
Make sure you're running SPONGE somewhere that has enough space!
As an alternative to the bigbed file download, SPONGE can download
tracks for individual TFs on the fly and filter them individually.
This way of processing is slower than the bigbed file when all TFs in
the database are considered, but it becomes competitive when only
a subset is used.
The physical storage footprint is much reduced.
This option is enabled with on_the_fly_processing: True in the
configuration file.
For filtering, the default setting of n_processes is set to 1, but
we highly recommend increasing it if your machine is capable of it.
During our testing, the entire default workflow could be done in just
over 10 minutes with 16 processes (this excludes the time taken to
download the required files).
Users are free to provide their own files for the list of regions of
interest and their mapping to transcripts and genes
(region: region_file, default regions.tsv) and the list of predicted
TF binding sites (motif: tfbs_file, default tfbs.bb).
By default, if the paths are not provided or set to None, SPONGE
attempts to locate these files in the temporary folder under the default
names.
If it fails to do so, it will proceed to download them.
List of regions of interest expects a seven column tsv file with a defined header, as an example:
Chromosome Start End Transcript stable ID Gene stable ID Gene name Gene type
chr1 10676 11676 ENST00000832828 ENSG00000290825 DDX11L16 lncRNA
chr1 11260 12260 ENST00000450305 ENSG00000223972 DDX11L1 transcribed_unprocessed_pseudogene
chr1 17186 18186 ENST00000619216 ENSG00000278267 MIR6859-1 miRNA
chr1 24636 25636 ENST00000488147 ENSG00000227232 WASH7P transcribed_unprocessed_pseudogene
chr1 27839 28839 ENST00000834619 ENSG00000243485 MIR1302-2HG lncRNA
chr1 28804 29804 ENST00000473358 ENSG00000243485 MIR1302-2HG lncRNA
The predicted TF binding sites are expected in a binary bigbed file, with the following format when decoded:
chrom start end name score strand TFName
chr1 10000 10006 MA0467.3 276 - Crx
chr1 10000 10006 MA0648.2 233 + GSC
chr1 10000 10006 MA0682.3 231 + PITX1
chr1 10000 10006 MA0711.2 198 + OTX1
chr1 10000 10006 MA0714.2 246 + PITX3
Effectively, it is an extended bed format with a header, which uses
the name column to provide JASPAR matrix ID and the TFName column
to provide the actual name of the transcription factor.
However, currently SPONGE expects a bigbed file and will not work with
a bed file.
SPONGE releases are also provided as Docker containers.
The most basic way of running would involve mounting a directory to the
/app directory on the container, where SPONGE will be run:
docker run --mount type=bind,source="$(pwd)"/sponge_run,target=/app ghcr.io/kuijjerlab/netzoopy_sponge:latest --helpThe arguments match those of the netzoopy-sponge command line script.
In particular, it could be useful to generate an example input file
first using the --example option, then editing the configuration file
as appropriate.
Without mounting a directory, it is impossible to both provide an input
file and retrieve the generated prior networks, unless of course the
container is run interactively:
docker run -it --entrypoint bash ghcr.io/kuijjerlab/netzoopy_sponge:latestIn HPC environments, something like the apptainer shell command would
work.
Because of the libraries used for bigbed format support, SPONGE is not currently supported on Windows. Therefore, this container is probably the best way to run it there, and the command equivalent to the above in the command prompt would look like this:
docker.exe run --mount type=bind,source="%cd%"/sponge_run,target=/app ghcr.io/kuijjerlab/netzoopy_sponge:latest --helpThe project is: in progress.
Room for improvement:
- Better tests
- Try incorporating unipressed
- Improve overlap computations
To do:
- Support for more species
Many thanks to the members of the Kuijjer group at NCMBM/UH for their feedback and support.
This README is based on a template made by @flynerdpl.
Created by Ladislav Hovan (ladislav.hovan@ncmbm.uio.no). Feel free to contact me!
This project is open source and available under the GNU General Public License v3.