Skip to content

Issues with UHGG Reference Genome and GTF File Construction  #1

@Lzhewei

Description

@Lzhewei

I am working on generating a gene expression matrix using the MIC-Bac tool for human gut microbiome samples. I downloaded the GFF files from the species_catalogue/ folder under the UHGG database to create reference genomes and GTF files for building STAR Index.

However,I encountered issues during downstream analysis: the clustering results are not optimal, with the UMAP plot displaying poor separation between genera. Additionally, the quantification results were also poor. I suspect that there may be errors in my approach to constructing the gut reference genome.

Could you please provide a guidance on how to build the UHGG reference genome and GTF files?

I downloaded the GFF files for 4,744 species from the species_catalogue directory. I extracted the sequence information to create FASTA files and converted the GFF files to GTF format using gffread. Then, I combined all FASTA files and all GTF files into separate large files. Using these combined files, I built the STAR index with the following parameters:

STAR --runMode genomeGenerate \
	--runThreadN 24 \
	--genomeDir $index_dir \
	--genomeFastaFiles $fasta_file \
	--sjdbGTFfile $gtf_file \
	--sjdbGTFfeatureExon CDS \
	--limitGenomeGenerateRAM 107374182400 \
	--sjdbOverhang 149 \
	--genomeSAindexNbases 10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions