How do I use from_data for ancestral annotation?

Hi,

I want to apply ancestral annotation to a dataset stored in [VCF Zarr format](https://github.com/sgkit-dev/vcf-zarr-spec/blob/main/vcf_zarr_spec.md). Following the documentation, one way to achieve this is prepare data and use the `from_data` function. I'm slightly confused about some of the options though, and I hope you can help me out. 

IIUC for each site I need to pass `n_major`, `major_base`, `minor_base`, `outgroup_bases`, and `n_ingroups` to `from_data`. My dataset consists of some 500 samples, and I set `n_ingroups=10`. Do I understand it correctly then that I myself need to do the subsampling for each site? What I'm doing is the following: for each site, I have genotype calls in numpy arrays shape `(samples, ploidy)` that I flatten. I then subsample the list to length `n_ingroups`. From this subsample I then determine `n_major`, `major_base`, and `minor_base`. Therefore, no probabilistic sampling will occur later on as this subsample will be treated as a fix observation for this site. Is this the correct interpretation?

Another thought I had was also to utilize information from multiple outgroup samples from the same species. Currently I'm selecting one individual as the outgroup sample, but I guess one could also just sample probabilistically an outgroup allele from a number of outgroup samples, such that one reduces any bias introduced by having selected an outgroup individual with a heterozygote (or even homozygote ALT call) where all other individuals are homozygote REF. This was actually how I first interpreted how the outgroups should be defined (I admittedly read the docs poorly...), but realized something was wrong when the optimization step failed to converge (I have 10 outgroup individuals). 

Any help would be much appreciated.

Cheers,

Per

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I use from_data for ancestral annotation? #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How do I use from_data for ancestral annotation? #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions