This repository contains the code for our paper: The role of alternative splicing in driving yet another phase transition in genomic complexity [doi]
step1_get_data.py
- Download genome annotation and protein sequence for all species on Ensembl
- Calculate descriptive statistics for gene length, exon count, and protein length for each species
step2_merge_all_data.py
- Helper script to aggregate result files when step1 got interrupted.
step3_avg_variance_exon.py
- Script for Fig.1A
step4_mean_plots_gene_length.py
- Script for Fig.1B
step5_mean_plots_protein_length.py
- Script for Fig.1B inset
Protein-coding gene annotations in GFF3 format are obtained from Ensembl and EnsemblGenomes FTP server. Archaea and Bacteria are not included in this study for the lack of alternative splicing in general.
Protein sequences in FASTA format are obtained from Ensembl and EnsemblGenomes FTP server
| Database | Release | Genome Annotation | Protein Sequence |
|---|---|---|---|
| Ensembl | 114 | Link | Link |
| EnsemblMetazoa | 61 | Link | Link |
| EnsemblPlants | 61 | Link | Link |
| EnsemblFungi | 61 | Link | Link |
| EnsemblProtists | 61 | Link | Link |
After data acquisition and aggregation, panel_a_data.tsv is generated and serves as the primary dataset for statistical analysis and visualization.