-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hi,
I want to apply ancestral annotation to a dataset stored in VCF Zarr format. Following the documentation, one way to achieve this is prepare data and use the from_data function. I'm slightly confused about some of the options though, and I hope you can help me out.
IIUC for each site I need to pass n_major, major_base, minor_base, outgroup_bases, and n_ingroups to from_data. My dataset consists of some 500 samples, and I set n_ingroups=10. Do I understand it correctly then that I myself need to do the subsampling for each site? What I'm doing is the following: for each site, I have genotype calls in numpy arrays shape (samples, ploidy) that I flatten. I then subsample the list to length n_ingroups. From this subsample I then determine n_major, major_base, and minor_base. Therefore, no probabilistic sampling will occur later on as this subsample will be treated as a fix observation for this site. Is this the correct interpretation?
Another thought I had was also to utilize information from multiple outgroup samples from the same species. Currently I'm selecting one individual as the outgroup sample, but I guess one could also just sample probabilistically an outgroup allele from a number of outgroup samples, such that one reduces any bias introduced by having selected an outgroup individual with a heterozygote (or even homozygote ALT call) where all other individuals are homozygote REF. This was actually how I first interpreted how the outgroups should be defined (I admittedly read the docs poorly...), but realized something was wrong when the optimization step failed to converge (I have 10 outgroup individuals).
Any help would be much appreciated.
Cheers,
Per