SCAID - Single Cell Atlas of Immune Diseases
This repository contains a data preprocessing pipeline for the Single Cell Atlas of Immune Diseases (SCAID) project. The pipeline automates the preprocessing of single-cell RNA-sequencing (scRNA-seq) data, from raw fastq files to a ready-to-analyze dataset, focusing on immune-related diseases.
- Creating Input Config file
- CellRanger Wrapper
- CellBender Wrapper
- CSP Demultiplexing
- SNP Demultiplexing
- Preprocessing
- Integration
SCAID aims to generate a comprehensive single-cell atlas for immune diseases, providing a detailed cellular landscape and facilitating novel insights into immune cell heterogeneity across diseases. The SCAID Data Preprocessing Pipeline Ver.1.0 standardizes and streamlines the data preprocessing for single-cell RNA sequencing datasets, supporting various chemistries and preprocessing techniques like ambient RNA removal and demultiplexing.
The pipeline supports multi-modal data types (e.g., 3' GEX, 5' GEX, ATAC, VDJ, and Multiome) and automates the process of data input, preprocessing, and integration for further downstream analysis.
The first step in the pipeline is generating a configuration file that defines your dataset’s structure. This file should be in CSV format and contain information on raw fastq files and the associated metadata for each sample.
The CSV should contain the following columns:
Sample_ID: Unique identifier for each samplePath: Path to the fastq filesChemistry: Specifies the sequencing chemistry (e.g., 3' GEX, 5' GEX, VDJ, etc.)Additional_Metadata: Any additional relevant information
You can customize the fields based on your requirements.
The CellRanger Wrapper (Wrapper_CellRanger.sh) script automates the execution of various CellRanger workflows for different types of single-cell data and sequencing chemistries. This script supports both single and multiplexed data, with options for SNP or hashtag-based demultiplexing. It handles multiple sequencing technologies such as 3' GEX, 5' GEX, VDJ, ATAC, and Multiome.
You can run the Wrapper_CellRanger.sh script with the following options:
bash Wrapper_CellRanger.sh \
--plex [Single | Multi_SNP | Multi_CSP] \
--chemistry [3 | 5 | ATAC | Multiome] \
--id [cellranger_id] \
--fastq_name [fastq prefix name] \
--fastq_path [directory path where fastqs are located] \
--config [config csv file]- 3' GEX
- 5' GEX
- VDJ
- ATAC
- Multiome
This ensures that the appropriate CellRanger commands and parameters are applied to each dataset, simplifying the management of large, complex data repositories.
bash Wrapper_CellRanger.sh \
--plex Single \
--chemistry 3 \
--id Sample_001 \
--fastq_name Sample_001 \
--fastq_path /path/to/fastqbash Wrapper_CellRanger.sh \
--plex Multi_SNP \
--chemistry 5 \
--id Sample_002 \
--config config_file.csvbash Wrapper_CellRanger.sh \
--plex Multi_CSP \
--chemistry 5 \
--id Sample_003 \
--config config_file.csvThe CellBender Wrapper (Wrapper_CellBender.sh) automates the ambient RNA removal process using CellBender v0.3.0 for specific disease codes from the SCAID dataset. This script filters CellRanger output files based on a disease code and runs CellBender on the raw_feature_bc_matrix.h5 files, removing ambient RNA contamination to enhance the quality of downstream single-cell data analysis.
This script requires a disease code as input. It automatically identifies the relevant raw_feature_bc_matrix.h5 files, processes them with CellBender, and stores the cleaned data in the same directory.
bash Wrapper_CellBender.sh <disease_code><disease_code>: The code associated with the disease or dataset of interest (e.g., "ILD"). This code is used to filter the files before running CellBender.
1. Activate the CellBender Environment: The script begins by activating the CellBender v0.3.0 environment. You may replace the default environment path with your own if necessary.
source activate cellbender2. Find CellRanger Output Files: It searches for raw_feature_bc_matrix.h5 files in the specified CellRanger output directory (/02.CellRanger_output).
3. Filter by Disease Code: The script filters the list of raw_feature_bc_matrix.h5 files based on the specified disease code (from the 7th layer of the file path). Only files located in the "outs" directory are considered for processing.
4. Run CellBender: For each filtered file, the script checks whether CellBender has already been run (i.e., if a CellBender_Done file exists). If not, CellBender is executed to remove ambient RNA contamination using the following parameters:
- input: The path to the raw_feature_bc_matrix.h5 file.
- output: The cleaned file is saved as <cellranger_id>_cellbender_output.h5.
- cuda: Enables GPU acceleration for faster processing.
- fpr: Sets the false positive rate threshold for ambient RNA (default: 0.01, 0.05, 0.1).
- epochs: Number of training epochs (default: 150).
- low-count-threshold: Sets the threshold for low counts (default: 5).
- projected-ambient-count-threshold: Sets the projected ambient count threshold (default: 0.1; may need adjustment for scATAC-seq data).
5. Mark Completion: Once CellBender has successfully processed a sample, a CellBender_Done file is created in the sample directory to avoid redundant reprocessing.
The script prevents reprocessing by checking for the presence of a CellBender_Done file.
The CSP Demultiplexing (Hashtag_demulti.sh) script is designed to automate the post-CellRanger multi-step demultiplexing process for hashtag-labeled (CSP) single-cell RNA-seq data. The script processes CellRanger outputs by extracting and formatting the necessary data for downstream analyses, and it converts BAM files into FASTQ files using bamtofastq.
This script processes all the sample outputs from a CellRanger multi-step run, extracts key metrics (such as the number of reads and cells), rounds up the number of reads, and runs bamtofastq on the corresponding BAM files.
bash Hashtag_demulti.sh -p <absolute_path> -f <folder_id>- -p, --path: Specify the absolute path to the base directory where the CellRanger output folder is located.
- -f, --folder: Specify the folder identifier for the CellRanger run to process.
bash Hashtag_demulti.sh -p /mnt/gmi-l1/_90.User_Data/Shared_SCAID/02.CellRanger_output -f sample_001-
Activate CellRanger Tools:
The script begins by sourcing the bundled tools with CellRanger to ensure bamtofastq and other required tools are available in the path.source /mnt/gmi-l1/_90.User_Data/Shared_SCAID/Programs/cellranger-7.2.0/sourceme.bash -
Round Up Function:
The script includes a custom function to round up the number of reads extracted from the metrics file to the nearest significant digit. This ensures optimal splitting of FASTQ files during the bamtofastq step. -
Identify Subfolders:
The script searches for all subfolders within theouts/per_sample_outsdirectory for each sample in the specifiedfolder_id. -
Extract and Process Metrics:
For each subfolder, the script extracts the number of reads and the number of cells from themetrics_summary.csvfile. It then rounds up the number of reads and prepares to process the BAM files. -
Run bamtofastq:
The bamtofastq tool is run on the BAM file for each subfolder, using the rounded number of reads to split the files appropriately.bamtofastq --reads-per-fastq=<rounded_number_of_reads> <input_bam_file> <output_fastq_directory>
-
Create Per-Sample CSV:
After converting the BAM to FASTQ, the script generates a configuration CSV file for each sample. This file includes:- Gene Expression library configuration.
- VDJ library configuration.
- A library list that points to the generated FASTQ files for each sample.
The CSV is created in the base path and is used for downstream processing.
-
Important Considerations:
The script assumes that the firstbamtofastqfolder created for each sample is for Gene Expression (GEX). It is important to ensure that the GEX library comes before the CMO library in the original CellRanger config file.
For each sample processed, the script generates:
- A folder containing the FASTQ files split from the BAM file.
- A per-sample CSV file that specifies the libraries and references for the subsequent CellRanger or downstream analysis.
- Custom BAM and FASTQ Processing: The script extracts the number of reads and cells from the CellRanger output metrics, rounds the reads, and splits the BAM files into FASTQ files for further analysis.
- CSV Creation: After processing, a CSV file is generated for each sample, specifying the reference data, the number of cells, and the libraries to be used in future analyses.
- Ensure Proper GEX and CMO Configuration: The script assumes that GEX appears before CMO in the original configuration file. This should be verified manually in cases where TCR/BCR information is included.
The SNP Demultiplexing (souporcell.sh) script is designed to handle the SNP-based demultiplexing of multiplexed single-cell RNA-seq data. It uses the SoupOrCell algorithm to demultiplex cells based on genetic variation (SNPs) between individuals. This script processes CellRanger outputs and uses CellBender-filtered barcodes to accurately assign cells to their respective donors.
This script processes all multiplexed sample outputs from CellRanger, identifies SNP clusters, and assigns cells to individual donors. It requires CellBender-processed barcode files and corresponding BAM files.
bash souporcell.shThis script iterates through a predefined list of sample-multiplex_k pairs, where:
sample: The name of the sample processed by CellRanger.multiplex_k: The number of multiplexed individuals (donors) in the sample.
-
Activate the SoupOrCell Environment: The script activates the
souporcellenvironment to run the SoupOrCell pipeline.source activate souporcell -
Process CellRanger Outputs: For each sample, the script uses the BAM file generated by CellRanger and the cell barcodes processed by CellBender to perform SNP-based demultiplexing.
-
Run SoupOrCell: The script runs SoupOrCell on the BAM and barcode files, specifying the number of donors (
multiplex_k) and outputs the demultiplexing results, including cell assignments and SNP clusters.
clusters.tsv: SNP-based clusters that identify cells belonging to different donors.genotype.vcf: SNP genotype information for each donor.assignment.tsv: A file linking each cell barcode to its respective donor.
This research was supported by the Bio & Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government (MSIT), grant number 2022M3A9D3016848.
<img src="https://github.com/user-attachments/assets/a6eac0c1-1745-4f8b-84b7-6024374036d9" alt="Research Funding">

