Tools to add genomic positions to files that contain dbSNP IDs. The pipeline downloads dbSNP from the UCSC Genome Browser (UCSC/NCBI dbSNP mirrors), filters to the required columns, splits the reference for faster lookups, and then maps IDs in parallel. It supports dbSNP releases 151, 153, and 155 (default: 155).
- R packages:
data.table,optparse,parallel,here - Command line tools:
split,gzip(optional:pigzfor faster decompression),aria2c(optional, multi-connection downloads) - Optional but recommended: UCSC
bigBedNamedItemsutility + dbSNP BigBed (dbSnp155.bb) — defaults target to hg38/dbSNP155
Create and activate the environment once:
mamba env create -f environment.yml
mamba activate mapdbsnpIf you update environment.yml later:
mamba env update -f environment.yml --prune
mamba activate mapdbsnpThe env includes aria2, pigz, R, and required R packages.
Linux/HPC users should also install bigBedNamedItems from Bioconda into the same env:
mamba activate mapdbsnp
mamba install -n mapdbsnp -c bioconda ucsc-bigbednameditems -y
which bigBedNamedItemsIf you are not using mamba, install the R dependencies with:
Rscript -e 'install.packages(c("data.table","optparse","parallel","here"))'Install aria2c (optional, for faster downloads):
- Conda/Mamba:
conda install -c conda-forge aria2(ormamba install -c conda-forge aria2) - Homebrew (macOS):
brew install aria2 - Debian/Ubuntu:
sudo apt-get install -y aria2 - RHEL/CentOS:
sudo yum install -y aria2
Using the UCSC BigBed file skips the 90–100 GB text download and parallel awk scan.
BigBed mode streams the input in chunks and auto-sizes chunk length by input rows and workers (--chunk-size=0, default), targeting roughly one chunk per worker for lower overhead and predictable parallel fan-out. Chunk processing is parallelized, defaulting to --cpus workers.
By default, non-primary contigs (hap/alt/random-like chromosome names) are excluded from output.
If an rsID maps to multiple primary positions, the first mapping (chromosome order 1-22,X,Y,MT, then lowest position) is kept in the main output, and the remaining mappings are written to <prefix>_multiPos_dbSNP<version>_<build>.txt.
Output coordinates are reported in 1-based genomic coordinates: POS is always the 1-based leftmost mapped position (including indels). POS0 is UCSC 0-based and is only emitted in the multi-position diagnostic file.
Auxiliary outputs are build-specific: <prefix>_noMatch_dbSNP<version>_<build>.txt and <prefix>_multiPos_dbSNP<version>_<build>.txt.
During BigBed runs, the script prints stage timing and chunk progress (done/total, %, elapsed time, ETA) plus heartbeat updates while chunks are still running.
- Ensure
bigBedNamedItemsis available:
- Linux/HPC (recommended with mamba):
mamba activate mapdbsnp mamba install -n mapdbsnp -c bioconda ucsc-bigbednameditems -y which bigBedNamedItems
- macOS (Apple Silicon, manual UCSC binary):
curl -L http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.arm64/bigBedNamedItems -o ./script/bigBedNamedItems && chmod +x ./script/bigBedNamedItems - macOS (Intel, manual UCSC binary):
curl -L http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/bigBedNamedItems -o ./script/bigBedNamedItems && chmod +x ./script/bigBedNamedItems - Linux (x86_64, manual UCSC binary):
curl -L http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigBedNamedItems -o ./script/bigBedNamedItems && chmod +x ./script/bigBedNamedItems
If you are on an HPC (or older enterprise Linux) and see errors like GLIBC_2.29 not found / GLIBC_2.33 not found / GLIBC_2.34 not found, use the Bioconda build in the mapdbsnp env. The script resolves bigBedNamedItems from PATH first.
mamba activate mapdbsnp
mamba install -n mapdbsnp -c bioconda ucsc-bigbednameditems -y
which bigBedNamedItems
file "$(which bigBedNamedItems)"
ldd "$(which bigBedNamedItems)" | head- Download the dbSNP BigBed for your build (defaults to
./data/dbSnp<version>_<build>.bb):
- hg38 (default):
http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp155.bb(ordbSnp153.bb/dbSnp151.bbif you set--dbsnp-version) - hg19:
http://hgdownload.soe.ucsc.edu/gbdb/hg19/snp/dbSnp155.bb(ordbSnp153.bb/dbSnp151.bb) Make sure the file you download matches the--buildyou use. The script will auto-download to./dataif missing, unless--no-bbis set. Tip: if you have bandwidth,aria2ccan speed this up:
aria2c -x8 -s8 -o dbSnp155_hg38.bb http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp155.bb
mv dbSnp155_hg38.bb ./data/If you do not have aria2c, use curl:
curl -L http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp155.bb -o ./data/dbSnp155_hg38.bb- Run with
--bb-file:
mamba activate mapdbsnp
Rscript ./script/positionsFromDBSNP.r \
--input=./example/example_input.txt \
--ID=ID \
--build=hg38 \
--dbsnp-version=155 \
--cpus=24 \
--bb-file=./data/dbSnp155_hg38.bb \
--outdir=./example \
--prefix=example_bb \
--data-dir=./dataThe script will still download RsMergeArch.bcp for ID updates if it is not present in ./data.
Use this only if you are not using the BigBed fast path. It downloads and splits the full text dumps.
Download and preprocess dbSNP once, then reuse across runs:
mamba activate mapdbsnp
Rscript ./script/prepare_reference_data.R \
--build=both \
--dbsnp-version=155 \
--data-dir=./data \
--cpus=8This fetches dbSNP 155 for hg19 and hg38 (~90–100 GB total after splitting) and the RsMerge archive, storing everything under ./data. Use --build=hg19 or --build=hg38 to limit downloads, and --split-lines to adjust chunk size.
Note: downloads automatically prefer aria2c (multi-connection) when available; otherwise they fall back to download.file.
Warning: the text/awk path is legacy and slow for large inputs; prefer the BigBed fast path whenever possible.
mamba activate mapdbsnp
Rscript ./script/positionsFromDBSNP.r [options]Key options:
--inputpath to file with dbSNP IDs (e.g., summary statistics)--IDcolumn name containing dbSNP IDs (default:ID)--buildgenome build:hg19orhg38--dbsnp-versiondbSNP release to use (151,153, or155; default:155)--bb-filepath to dbSNP BigBed file (if set, text-based lookup is skipped; relative names are also searched under--data-dir)--chunk-sizerows per chunk for BigBed streaming mode (0= auto from row count and workers; default:0)--bb-workersoptional override for BigBed chunk workers (0means use--cpus; default:0)--include-alt-chrominclude non-primary contigs in output (default: off)--no-bbdisable the BigBed fast path and force text lookup--data-dirdirectory for reference data (default:./data)--outdiroutput directory--prefixprefix for output file name (defaults to input filename)--cpusCPUs to use for parallel lookups--skipskip this many lines in the input file--prepare-onlydownload reference data and exit
mamba activate mapdbsnp
Rscript ./script/positionsFromDBSNP.r \
--input=./example/example_input.txt \
--ID=ID \
--build=hg38 \
--dbsnp-version=155 \
--outdir=./example \
--prefix=example \
--data-dir=./data \
--cpus=16# Create environment once, then activate it in each new shell
mamba env create -f environment.yml
mamba activate mapdbsnp
# Get UCSC BigBed tool (Apple Silicon) and make executable
curl -L http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.arm64/bigBedNamedItems -o ./script/bigBedNamedItems
chmod +x ./script/bigBedNamedItems
# Download dbSNP BigBed (hg38/dbSNP155)
aria2c -x8 -s8 -o dbSnp155_hg38.bb http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp155.bb
mv dbSnp155_hg38.bb ./data/
# Run example
Rscript ./script/positionsFromDBSNP.r \
--input=./example/example_input.txt \
--ID=ID \
--build=hg38 \
--dbsnp-version=155 \
--bb-file=./data/dbSnp155_hg38.bb \
--outdir=./example \
--prefix=example_bb \
--data-dir=./data# Create environment once, then activate it in each new shell
mamba env create -f environment.yml
mamba activate mapdbsnp
curl -L http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/bigBedNamedItems -o ./script/bigBedNamedItems
chmod +x ./script/bigBedNamedItems
aria2c -x8 -s8 -o dbSnp155_hg38.bb http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp155.bb
mv dbSnp155_hg38.bb ./data/
Rscript ./script/positionsFromDBSNP.r --input=./example/example_input.txt --ID=ID --build=hg38 --dbsnp-version=155 --bb-file=./data/dbSnp155_hg38.bb --outdir=./example --prefix=example_bb --data-dir=./data# Create environment once, then activate it in each new shell
mamba env create -f environment.yml
mamba activate mapdbsnp
# Install BigBed utility in the environment and use it from PATH
mamba install -n mapdbsnp -c bioconda ucsc-bigbednameditems -y
which bigBedNamedItems
# Download dbSNP BigBed (hg38/dbSNP155)
aria2c -x8 -s8 -o dbSnp155_hg38.bb http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp155.bb
mv dbSnp155_hg38.bb ./data/
# Run example
Rscript ./script/positionsFromDBSNP.r \
--input=./example/example_input.txt \
--ID=ID \
--build=hg38 \
--dbsnp-version=155 \
--bb-file=./data/dbSnp155_hg38.bb \
--outdir=./example \
--prefix=example_bb \
--data-dir=./data# Create environment once, then activate it in each new shell
mamba env create -f environment.yml
mamba activate mapdbsnp
# Install BigBed utility in the environment and use it from PATH
mamba install -n mapdbsnp -c bioconda ucsc-bigbednameditems -y
which bigBedNamedItems
file "$(which bigBedNamedItems)"
ldd "$(which bigBedNamedItems)" | head
# Download dbSNP BigBed (hg38/dbSNP155)
aria2c -x8 -s8 -o dbSnp155_hg38.bb http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp155.bb
mv dbSnp155_hg38.bb ./data/
# Run example
Rscript ./script/positionsFromDBSNP.r \
--input=./example/example_input.txt \
--ID=ID \
--build=hg38 \
--dbsnp-version=155 \
--bb-file=./data/dbSnp155_hg38.bb \
--outdir=./example \
--prefix=example_bb \
--data-dir=./data- UCSC Genome Browser downloads (dbSNP tables): https://hgdownload.soe.ucsc.edu/goldenPath/
- UCSC gbdb BigBed sources for dbSNP: https://hgdownload.soe.ucsc.edu/gbdb/
- UCSC bigBedNamedItems utility: http://hgdownload.cse.ucsc.edu/admin/exe/
- NCBI dbSNP RsMerge archive: https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/organism_data/