A critical QC step in droplet-based single-cell (and -nucleus) transcriptomic profiling is the exclusion of cell “doublets”, in which two cells are encapsulated into the same droplet. In silico doublet detection algorithms typically operate by comparing the observed cells to synthetic doublets formed by summing or averaging the gene expression profiles from two putative singlets within the dataset. However, this common method of synthetic doublet creation is conceptually flawed in systems where absolute RNA levels are expected to dramatically vary between cell types.
I thus modified state-of-the-art doublet detection software by modifiying the synthetic doublet generation process to be Absolute RNA Content Aware (ARCA). Following cell clustering and annotation, cells will be assigned a “high”, “average,” or “low” ARC label through one of three techniques: a naïve (i) manual user specification, or empirically based (ii) in-dataset housekeeping genes and (iii) deconvolution from external bulk reference data.
This initial proof of concept that incorporating information about Absolute RNA Content into doublet detection may improve doublet detection in real and simulated datasets, while having limited negative consequences in even in the absence of ARC imbalance.
Method development, evaluation on real and simulated datasets, and manuscript writing was performed in one month per the UCLA Bioinformatics PhD Written Qualifying Exam in 2024.
github.com/chooliu/ARCA_scRNAseq_Doublet_Detection