-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Hi
First of all, thanks for developing TRTools! I have a question regarding imputed STR input files.
I have a Beagle (v5.4) imputed VCF file that looks like below:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##filedate=20230503
##source="beagle.22Jul22.46e.jar"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=DR2,Number=A,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2P(RR)] and true REF dose">
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + 2P(AA)]">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
| #CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | Sample1 |
|---|---|---|---|---|---|---|---|---|---|
| 18 | 136703 | STR_614109 | TCCGGCAAAAAAAAAAAAAAAA | TCCAGCAAAAAAAAAAAAAAAA,TCCGGAAAAAAAAAAAAAAAAA,TGCGGCAAAAAAAAAAAAAAAA,TCCGGCAAAAAAAAAAAAAAAAA,TCCGGAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAGAA,TCCGGAAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAAAGAA | . | PASS | DR2=0,0.35,0,0.28,0,0.97,0.33,0.93,0.45;AF=0,0.008,0,0.0082,0,0.0352,0.0044,0.2689,0.0045;IMP | GT:DS | 0|8:0,0,0,0,0,0,0,1,0 |
The imputation reference panel is from: http://gymreklab.com/2018/03/05/snpstr_imputation.html that was described in the Saini et al. paper: https://www.nature.com/articles/s41467-018-06694-0
I typically split these STR alleles and perform biallelic STR GWAS on binary phenotypes of interest, however I am also interested in a length-based STR GWAS. That is how I came across with associaTR available in TRTools, that seems to be capable of running length-based GWAS. I made a test as below with chr18 STRs:
associaTR results.tsv input_chr18_fortesting.vcf.gz cases phenotypes_forTest_caseControl.npy --same-samples --beagle-dosages --vcftype hipstr
...however this gave an error as:
Traceback (most recent call last):
File "/home/fahri/miniconda3/bin/associaTR", line 10, in
sys.exit(run())
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/associaTR/associaTR.py", line 582, in run
main(args)
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/associaTR/associaTR.py", line 599, in main
perform_gwas(
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/associaTR/associaTR.py", line 449, in perform_gwas
perform_gwas_helper(
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/associaTR/associaTR.py", line 207, in perform_gwas_helper
extra_detail_fields = next(genotype_iter)
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/associaTR/load_and_filter_genotypes.py", line 117, in load_trs
inferred_vcftype = trh.InferVCFType(vcf, vcftype if vcftype else 'auto')
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/utils/tr_harmonizer.py", line 209, in InferVCFType
raise TypeError('Could not identify the type of this vcf')
TypeError: Could not identify the type of this vcf.
Of note, above I set --vcftype as hipstr, as Saini et al. paper seems to use HipSTR.
Then I realized this section of the documentation regarding Beagle: https://trtools.readthedocs.io/en/stable/CALLERS.html#beagle and used trtools_prep_beagle_vcf.sh convert my imputed VCF into a VCF that can be used in associaTR:
trtools_prep_beagle_vcf.sh hipstr 1kg.snp.str.chr18.vcf.gz input_chr18_fortesting.vcf.gz converted_input_chr18_fortesting.vcf.gz
...where 1kg.snp.str.chr18.vcf.gz was the reference panel downloaded from http://gymreklab.com/2018/03/05/snpstr_imputation.html, as mentioned above. I got another error here:
Creating temporary file ... ~/tmp/tmp.ULy8Ry1uQ1.vcf
Copying over meta header lines from the reference panel, then copying the imputed file contents
bgzipping and tabix indexing
Adding INFO fields and values from the ref_panel, then removing loci which are missing required INFO fields
[W::hts_idx_load3] The index file is older than the data file: ~/1kg.snp.str.chr18.vcf.gz.tbi
The INFO tag "START" is not defined in ~/1kg.snp.str.chr18.vcf.gz, was the -h option provided?
Failed to read from standard input: unknown file type
Is this error related to the fact that there is no INFO tag of "START" (not sure what it is?) in the chr18 STR imputation reference file*, and/or something else related to the "unknown file type" (maybe my file is not in "HipSTR" format?)?
Can you please help with this error? Thank you very much in advance!
Best,
Fahri
*PS: This is how the imputation reference file header and a randomly chosen STR look like:
##fileformat=VCFv4.2
#filedate=20180225
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose">
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
##contig=<ID=18>
| #CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | HG00096 |
|---|---|---|---|---|---|---|---|---|---|
| 18 | 147475 | STR_614123 | ATACAAAAAAAAAAAAAAA | ATACAAAAAAAAAAAAA,AAACAAAAAAAAAAAAAA,ATACAAAAAAAAAAAAAA,AAACAAAAAAAAAAAAAAA,ATACAAAAAAAAAAAAAAAA,ATACAAAAAAAAAAAAAAAAA | . | PASS | AR2=0.89;DR2=0.91;AF=0.016,0.0019,0.33,0.0025,0.0057,0.004;IMP | GT:DS | 3|0:0,0,0.99,0,0,0 |