Authors:
Marjan Hosseini
Devin J. McConnell
Derek Aguiar
The code for the BSEEJ method is provided in bseej.py.
A compressed folder of junction files for one example gene (A2ML1) is included in this repository.
To run BSEEJ with default configurations, first install dependencies (e.g., from requirements.txt or environment.yml), then run bseej.py. A more detailed guide follows.
- Clone the repository:
git clone https://github.com/bayesomicslab/BSEEJ.git
cd BSEEJ- Install the project dependencies and activate the environment using one of these options:
Option 1 (recommended): Conda
conda env create -f environment.yml
conda activate bseej_envOption 2: venv + pip (uses pinned versions from requirements.txt)
python -m venv .venv
# Linux / macOS
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt- Unzip the example data:
unzip A2ML1.zipQuick start (uses default parameters):
python3 bseej.pyExample with explicit paths/params (documents intent and works on fresh setups):
python3 bseej.py -g A2ML1 -p ./A2ML1 -o ./results -k 1 -i 1000 -e 0.01 -a 1 -r 1 -s 1Note: The junction path must contain
.juncfiles (e.g., the unzippedA2ML1folder).
The output directory will be created if it does not exist.
python3 bseej.py [options]Options (defaults shown by python3 bseej.py --help):
-g, --gene <GENE_NAME>: Gene symbol to train (e.g.,A2ML1; default:A2ML1)-p, --junction_path <DIR>: Directory containing.juncfiles (default:./A2ML1)-o, --out_dir <DIR>: Output directory (created if missing; default:./results)-k, --clusters <INT>: Number of clusters (default:1)-i, --max_iter <INT>: Maximum iterations (default:1000)-e, --eta <FLOAT>: Model parameter η (default:0.01)-a, --alpha <FLOAT>: Model parameter α (default:1.0)-r, --r <FLOAT>: Model parameter r (default:1.0)-s, --s <FLOAT>: Model parameter s (default:1.0)-h, --help: Show help and exit
An example notebook (example.ipynb) demonstrates usage.
For creating the EGA and simulated data BAM files we ran the STAR [1] aligner for fast and accurate alignment.
STAR --runThreadN 20 --genomeDir ../genome_data/genome_index/ --outFileNamePrefix ./person_${i}_ --twopassMode Basic --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --readFilesIn ${1}/person_${i}_1.fa ${1}/person_${i}_2.faIn this example $${1}$ is the directory where the files are located. The input file 'person_*_1.fa' is a collection of genes for the ith sample on forward sequence and the second is the collection of genes on the backward sequence. This allows us to use the twopassMode and we also used the intronMotif in order to obtain spliced alignments (XS). Once the files have been aligned they are then separated out into the individual bam files of just one gene to work on at a time.
Regtools [2] was used for an efficient filtering of the junctions.
regtools junctions extract -s 0 -a 6 -m 50 -M 500000 %s -o %s.junc Here the two %s are the bam file and output name respectively. On all data used in this project, EGA, Geuvadis, and simulations, we used the following flags:
* -s: finds XS/unstranded flags
* -a: minimum anchor length into exon (6 bp)
* -m: minimum intron size (50 bp)
* -M: Maximum intron size (500000 bp)
portcullis prep -t 20 -v --force -o %s_portcullis/1-prep/ GRCh38.primary_assembly.genome.fa %s/%s.bamFirst step of portcullis [3] is prep and it takes in the fasta file from the reference genome used, here is an example from the data simulations. It is important to note that portcullis was run on our simulations and both experimental results. Here %s is there name of the folder to direct output to and the bam file name.
portcullis junc -t 20 -v -o %s_portcullis/2-junc/portcullis_all --intron_gff %s_portcullis/1-prep/ The next step that is being done is extraction of the junctions into a gff format. Here %s is just giving it the name of the folder to look into.
portcullis filt -t 20 -v -n --max_length 500000 --min_cov 30 -o %s_portcullis/3-filt/portcullis_filtered --intron_gff %s_portcullis/1-prep/ %s_portcullis/2-junc/portcullis_all.junctions.tabFinally, we set some filtering using portcullis. We only keep introns that have a max length of 500000, do not use the machine learning filtering, and have a minimum coverage of 30. Here again %s is just point to the gene folder to look into for the files. After portcullis is complete we do an overlap check between the junctions found from regtools and portcullis and only keep the ones that overlap with at least 90%.
BREM Probabilistic Graphical Model
Main variables and parameters include:
-
$V$ is the set of unique intron excisions, indexed by$v$ and its size of denoted by$|V|$ . -
$N$ is the number of samples and are indexed by$i$ . -
$J_i$ is the number of intron excisions in ith sample. -
$K$ is the number of clusters (indexed by$k$ ). -
For the jth intron excision in the ith sample, we assign a cluster k.
-
Graph
$G = (V, E)$ , where V is the set of unique intron excision and there is an edge between two intron excisions iff they intersect each other. -
$\Omega$ is the set of all the independent sets in$G$ . -
$r, s$ are priors for$\pi$ (Beta distribution):
* Increase in mean of Beta$(r,s)$ results in increase in cluster size |SEEJ|.
- The structure of a (clusters) SEEJs consists of the inclusion or exclusion of intron excisions.
- For cluster k,
$\beta_{k}$ is a$|V|$ -dimensional Dirichlet which represents the distribution of the cluster k over the intron excisions.
*
- For the ith sample, variable
$\theta_i$ is a$K$ -dimensional Dirichlet distribution and represents the proportions of the clusters in sample$i$ .
*
- Variable
$z_{ij}$ is the cluster assignment for jth intron excision in ith sample. It can take a natural value between 1 and$K$ and follows a Multinomial:
- In the ith sample, the jth intron excision is
$w_{ij}$ and is observed and follows a Multinomial distribution:
[1] Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., ... & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21.
[2] Feng, Y. Y., Ramu, A., Cotto, K. C., Skidmore, Z. L., Kunisaki, J., Conrad, D. F., ... & Griffith, M. (2018). RegTools: Integrated analysis of genomic and transcriptomic data for discovery of splicing variants in cancer. BioRxiv, 436634.
[3] Mapleson, D., Venturini, L., Kaithakottil, G., & Swarbreck, D. (2018). Efficient and accurate detection of splice junctions from RNA-seq with Portcullis. GigaScience, 7(12), giy131.
