-
Notifications
You must be signed in to change notification settings - Fork 0
Description
I am working on generating a gene expression matrix using the MIC-Bac tool for human gut microbiome samples. I downloaded the GFF files from the species_catalogue/ folder under the UHGG database to create reference genomes and GTF files for building STAR Index.
However,I encountered issues during downstream analysis: the clustering results are not optimal, with the UMAP plot displaying poor separation between genera. Additionally, the quantification results were also poor. I suspect that there may be errors in my approach to constructing the gut reference genome.
Could you please provide a guidance on how to build the UHGG reference genome and GTF files?
I downloaded the GFF files for 4,744 species from the species_catalogue directory. I extracted the sequence information to create FASTA files and converted the GFF files to GTF format using gffread. Then, I combined all FASTA files and all GTF files into separate large files. Using these combined files, I built the STAR index with the following parameters:
STAR --runMode genomeGenerate \
--runThreadN 24 \
--genomeDir $index_dir \
--genomeFastaFiles $fasta_file \
--sjdbGTFfile $gtf_file \
--sjdbGTFfeatureExon CDS \
--limitGenomeGenerateRAM 107374182400 \
--sjdbOverhang 149 \
--genomeSAindexNbases 10