Skip to content

SWAMPy method

Rabia Fidan edited this page May 29, 2024 · 13 revisions

How it works

1. Read input files

Some input files are expected to be provided by the user while some are expected to be chosen from a predefined set.

User-given

  • genome_file: a multi-FASTA containing (at least) all of the genomes that will be present in the simulated mixture (and possibly others that will not).
  • genome_abundances: A tab-separated file containing relative abundances of the genomes in the mixture. Source genomes that are in the genomes_file but not in the genome_abundances file are ignored and will not be present in the simulated data. See CLI arguments, see example/abundance.tsv.
  • temp_folder A place to create subfolders for genomes, amplicons and indices: intermediate files are stored in these directories. --autoremove cleans them after execution. Use unique --temp_folders while running simultaneous instances of SWAMPy.
    • genomes: individual FASTA files of the genomes in the genome_file multi FASTA.
    • amplicons: individual FASTA files of all amplicons of the source genomes.
    • indices: bowtie2 index files of the genomes.

User-chosen input files. Primer set-specific files. Already provided within SWAMPy, alternatively custom files can be provided by the user

The corresponding set of input files are automatically defined with respect to the --primer_set choice of the user, alternatively custom files can be provided by the user.

  • primers FASTQ: a FASTQ file containing all the primers used. See e.g. primer_sets/artic_v4_primers.fastq. The format of each read:

      @name_{amplicon number}_{LEFT|RIGHT}
      {primer sequence}
      +
      {dummy quality scores (ignored)} 
    
  • primer BED: - a 6-column BED file of the primers, where positions are with respect to the Wuhan reference. The last 2 columns are ignored. Format:

      MN908947.3	start	end	nCoV-2019_1_LEFT	1	+
    
  • amplicon_distribution.tsv: a file containing a vector for the Dirchlet distribution governing the amplicon distribution of a primer set. See e.g. primer_sets/artic_v4_amplicon_distribution.tsv. Format:

     amplicon_number \t hyperparameter
    

Note that for the Dirichlet distributions, it is expected that the sum of the hyperparameter column is 1.

  • other options: (e.g. nreads) are command-line options

2. Create amplicon population

  • The set of primers are aligned with each genome, using bowtie2
  • The output sam file is parsed, and the matching positions and primer lengths, and primer names, are stored in a pandas dataframe.
  • The above dataframe is separated out into two subframes, LEFT and RIGHT. These subframes are then joined based on amplicon number.
  • At this stage, any primers that couldn't be aligned to the genome are dropped from the dataframe.
  • The remaining amplicons have good matches for both their left and right primers - these are extracted from the genomes and stored in fasta files.
  • The dataframe is appended to a 'main' table containing summary information about all amplicons from all genomes processed so far.

3. Simulating numbers of reads per amplicon

  • The amplicon_distribution.csv file is read, the hyperparameter vector (vector a) is stored.
  • Vector a is expected at this stage to satisfy sum(a) == 1.
  • Next, a is scaled by the amplicon_pseudocounts, c: a = a*c
  • Sample an amplicon abundance profile using p = Dir(a), Dirichlet distribution.
  • Each amplicon in the amplicon population dataframe is then given a number of reads by drawing from Multinomial(N_genome, p), where N_genome = int(NREADS * genome_proportion).
  • All of the above data is adjoined to the amplicon dataframe from step 2.

4. High-Frequency error introduction

There are two main components of high-frequency errors, recurrent and unique, and both have three types of errors (substitution, insertion and deletion). Recurrent errors can be seen across all source genomes in the mixture while unique errors are seen in only one of the genomes. PCR errors are expected to be more in line with unique errors while real data also shows some recurrent errors which might originate from RNA degradation.

  • A dataframe which contains error type, position, corresponding amplicon number, alternative allele, recurrence state, VAF and length information of each error is created.
    • Counts of each type of errors to be introduced are drawn from a Poisson distribution with mean (error rate * genome length).
    • VAF of each error is drawn from a Dirichlet distribution: (alt,ref)=Dir(alpha1,alpha2). Default alpha values were estimated from real data.
    • Position of the error within the genome is randomly sampled without replacement.
    • Insertion lengths are drawn from a uniform distribution. Default max insertion length is 14, as observed in real data.
    • Deletion lengths are drawn from a geometric distribution. Default p values were estimated from real data.
  • Amplicons from step 3 are aligned to the Wuhan reference.
  • Mutants of the amplicons are created using different combinations of errors that are on the same amplicon.
  • The read count of the mutants are determined with a binomial distribution, whose p is the VAF of the error.
  • The amplicon dataframe is updated so that it now contains the PCR mutants and new read counts.

5. Grouping

For computational efficiency, art_illumina will be run once for each unique resulting read count (for example, one art_illumina run for all amplicons that have a read count of 2; we refer to this as 'read count group 2'). Therefore, FASTA files of amplicons in the amplicon population that have the same number of read counts are merged.

6. art_illumina

For each read count group of amplicons, art_illumina is run to generate the required numbers of reads. At present we simulate with parameters appropriate for Illumina MiSeq v3, paired-end, 250bp, using the following parameters:

    art_illumina \
        --amplicon \
        --quiet \
        --paired \
        --rndSeed random_number \
        --noALN \
        --maskN 0  \
        --seqSys seqSys(default MSv3) \
        --in amplicon_file.fasta \
        --len 250 \
        --rcount n_reads \
        --out out_prefix    

All of these output files are then stitched together and shuffled into a random order by a randomly generated permutation.

7. Clean-up

All of the temporary files (genomes, amplicons, bowtie indices, and art output files) are deleted.

Clone this wiki locally